> "The batch window is not a technical limitation. It's a business agreement — and the business just changed the terms."
In This Chapter
- Introduction: When Batch is Not Fast Enough
- Project Assessment
- Phase 1: Message Infrastructure
- Phase 2: MedClaim Event Producer
- Phase 3: GlobalBank Event Consumer
- Understanding IBM MQ for COBOL Developers
- Phase 4: Parallel Running
- Phase 5: Monitoring and Alerting
- Understanding the Reconciliation in Depth
- Phase 6: Cutover Planning
- Phase 7: The Cutover
- Post-Cutover Operations
- Lessons from the Migration
- Working with the Student Mainframe Lab
- Architectural Alternatives: What Else Could They Have Built?
- Operational Runbooks
- Understanding the End-to-End Transaction Flow
- The Final Architecture
- Summary: The Complete Journey
- The Three Capstones: A Retrospective
- Closing Thoughts
Chapter 45: Capstone 3 — From Batch to Real-Time: A Full Migration Project
"The batch window is not a technical limitation. It's a business agreement — and the business just changed the terms." — Priya Kapoor, in the project kickoff meeting
Introduction: When Batch is Not Fast Enough
Capstone 1 taught you to build a COBOL system from scratch. Capstone 2 taught you to modernize a legacy system incrementally. This final capstone brings together everything you have learned in this textbook — every technique, every pattern, every design principle — to tackle the most complex challenge in enterprise COBOL: migrating a batch system to real-time event-driven processing while keeping the business running.
This is not a hypothetical exercise. Across the financial services and healthcare industries, organizations are facing the same pressure: customers, partners, and regulators want information now, not tomorrow morning. The nightly batch run that was acceptable in 2010 is a competitive disadvantage in 2025. But the batch system works. It is reliable, auditable, and understood. Replacing it with something faster cannot come at the cost of those qualities.
💡 The Fundamental Tension. Batch processing is inherently reliable because it is simple: read a file, process each record, write the results. There is no concurrency, no race conditions, no distributed state. Real-time processing introduces all of these complexities. The challenge is to gain the speed of real-time without losing the reliability of batch. This capstone shows you how.
The Business Problem
GlobalBank and MedClaim Health Services have entered a partnership. GlobalBank will process medical expense claims for MedClaim's members who hold GlobalBank health savings accounts (HSAs). When a MedClaim claim is adjudicated, the payment should be deducted from the member's GlobalBank HSA automatically.
Currently, this works through a nightly batch process:
- MedClaim's nightly batch produces a flat file of adjudicated claims
- The file is transmitted to GlobalBank via SFTP at 2 AM
- GlobalBank's morning batch reads the file and processes HSA debits
- Results are transmitted back to MedClaim in the afternoon
- MedClaim's next nightly batch posts the payment confirmations
Total elapsed time from claim adjudication to HSA debit: 24-48 hours.
The business wants this reduced to under 30 seconds.
"Thirty seconds," Derek Washington repeats when Priya Kapoor presents the requirement. "From claim adjudication to money moving?"
"Twenty-nine, if you want to underpromise," Priya replies. "The business case is simple: real-time HSA processing improves member satisfaction and reduces MedClaim's float. Both organizations benefit."
"And both organizations' batch systems need to keep running while we build this?"
"Obviously."
⚖️ The Stakes. This migration affects two production systems at two different organizations. A failure in the real-time system could cause incorrect HSA debits (taking money from members' accounts erroneously), duplicate payments, or lost transactions. The parallel-run period must prove that the real-time system produces exactly the same results as the batch system before batch is decommissioned.
Project Assessment
Current Architecture
Priya begins with a thorough assessment of both systems' current architecture. She has spent two months embedded with both teams — Maria Chen's at GlobalBank and James Okafor's at MedClaim — learning how the systems work.
MedClaim Side (Current):
Claims → [CLM-ADJUD] → Adjudicated Claims File
│
[CLM-EXTRACT] → HSA Payment File
│
SFTP Transfer (2 AM)
│
Flat file on GlobalBank LPAR
GlobalBank Side (Current):
HSA Payment File → [HSA-PROC] → VSAM Account Master Updated
│
[HSA-CONFIRM] → Confirmation File
│
SFTP Transfer (4 PM)
│
Flat file on MedClaim LPAR
│
[CLM-CONFIRM] → Claim Status Updated
Six programs, two SFTP transfers, and a minimum of 24 hours.
Target Architecture:
[CLM-ADJUD] ──→ MQ Message ──→ [HSA-EVENTS] ──→ HSA Updated
│
MQ Confirmation
│
[CLM-EVENTS] ──→ Claim Updated
Two new programs, one message queue, under 30 seconds.
Risk Assessment
Priya's risk assessment identifies the following concerns:
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| MQ message loss | Low | Critical | Persistent messages with dead-letter queue |
| Duplicate processing | Medium | High | Idempotent design with transaction IDs |
| Network failure between LPARs | Medium | Medium | MQ guaranteed delivery with retry |
| Real-time slower than 30 seconds | Medium | Medium | Performance testing with production volumes |
| Data inconsistency during parallel run | High | Medium | Reconciliation program runs daily |
| Rollback to batch needed | Low | Low | Batch infrastructure preserved for 6 months |
📊 The Parallel-Run Requirement. Both organizations require a parallel-run period where the batch and real-time systems process the same transactions simultaneously. A reconciliation program compares the results daily. Only when the reconciliation shows zero discrepancies for 30 consecutive business days will batch be decommissioned.
Phase 1: Message Infrastructure
IBM MQ Design
The message queue is the backbone of the real-time system. Priya designs a two-queue architecture:
Queue 1: MEDCLAIM.HSA.PAYMENTS — Carries adjudicated claim messages from MedClaim to GlobalBank.
Queue 2: GLOBALBANK.HSA.CONFIRMS — Carries payment confirmations from GlobalBank to MedClaim.
Both queues use persistent messages (messages survive queue manager restart), which ensures that no transaction is lost even if the MQ infrastructure fails.
MQ Object Definitions:
DEFINE QLOCAL('MEDCLAIM.HSA.PAYMENTS') +
DEFPSIST(YES) +
MAXDEPTH(100000) +
MAXMSGL(4096) +
BOTHRESH(5) +
BOQNAME('MEDCLAIM.HSA.PAYMENTS.DLQ') +
DESCR('HSA payment requests from MedClaim')
DEFINE QLOCAL('MEDCLAIM.HSA.PAYMENTS.DLQ') +
DEFPSIST(YES) +
MAXDEPTH(10000) +
DESCR('Dead letter queue for failed payment messages')
DEFINE QLOCAL('GLOBALBANK.HSA.CONFIRMS') +
DEFPSIST(YES) +
MAXDEPTH(100000) +
MAXMSGL(4096) +
BOTHRESH(5) +
BOQNAME('GLOBALBANK.HSA.CONFIRMS.DLQ') +
DESCR('HSA payment confirmations to MedClaim')
DEFINE QLOCAL('GLOBALBANK.HSA.CONFIRMS.DLQ') +
DEFPSIST(YES) +
MAXDEPTH(10000) +
DESCR('Dead letter queue for failed confirm messages')
Key design decisions:
DEFPSIST(YES): All messages are persistent by default. This means MQ writes them to disk before acknowledging the PUT. It is slower than non-persistent messaging but guarantees no data loss.
BOTHRESH(5): The backout threshold. If a message is read and rolled back 5 times (indicating the consuming program keeps failing), MQ moves it to the dead-letter queue (BOQNAME) instead of returning it to the queue for a 6th attempt. This prevents a "poison message" from blocking the queue.
MAXDEPTH(100000): If GlobalBank's consumer program is down, MQ can hold up to 100,000 messages before rejecting new ones. At 500,000 claims per month (approximately 20,000 per business day, of which maybe 5,000 involve HSAs), this provides nearly a full month of buffer.
Message Format
The message payload uses JSON. This is a deliberate choice: JSON is human-readable, widely supported, and can be generated and parsed natively in Enterprise COBOL v6+.
Payment Request Message:
{
"messageId": "MSG20240315143022001",
"messageType": "HSA_PAYMENT",
"timestamp": "2024-03-15T14:30:22.001",
"claimId": "CLM000098765",
"memberId": "MBR100045678",
"hsaAccountId": "HSA0045678",
"paymentAmount": 1250.00,
"diagnosisCode": "J06.9",
"procedureCode": "99213",
"serviceDate": "2024-03-10",
"providerName": "City Medical Center",
"adjudicationDate": "2024-03-15"
}
Payment Confirmation Message:
{
"messageId": "CFM20240315143023456",
"correlationId": "MSG20240315143022001",
"messageType": "HSA_CONFIRM",
"timestamp": "2024-03-15T14:30:23.456",
"claimId": "CLM000098765",
"hsaAccountId": "HSA0045678",
"status": "SUCCESS",
"newBalance": 3750.00,
"transactionRef": "TXN20240315001234"
}
The correlationId in the confirmation links back to the original payment request, enabling end-to-end traceability.
⚠️ Idempotent Design. Every message includes a unique messageId. The consuming program checks whether it has already processed a message with this ID before applying it. This makes the system idempotent — processing the same message twice produces the same result as processing it once. This is critical because MQ's guaranteed delivery means messages may be delivered more than once in failure scenarios.
Phase 2: MedClaim Event Producer
Modifying CLM-ADJUD
The existing CLM-ADJUD program processes claims and writes results to a flat file. To enable real-time processing, James Okafor modifies CLM-ADJUD to also put a message on the MQ queue for each HSA-eligible claim.
The modification follows the "and" pattern — the program does everything it did before AND puts a message on the queue. During the parallel-run period, both the flat file and the MQ message carry the same data. This allows the batch and real-time paths to process the same transactions.
IDENTIFICATION DIVISION.
PROGRAM-ID. CLM-ADJUD.
*================================================================*
* Program: CLM-ADJUD (Modified for real-time HSA processing) *
* Purpose: Adjudicate claims; send HSA events via MQ *
* Author: James Okafor (real-time additions) *
* Date: Modified 2024-07-01 *
*================================================================*
* MQ additions use EXEC CICS commands for MQ access via *
* the CICS-MQ adapter. *
*================================================================*
DATA DIVISION.
WORKING-STORAGE SECTION.
*--- Existing working storage preserved ---
COPY CLMREC.
01 WS-ADJUD-WORK PIC X(500).
*--- New MQ-related fields ---
01 WS-MQ-MESSAGE PIC X(2000).
01 WS-MQ-MSG-LENGTH PIC S9(08) COMP VALUE 0.
01 WS-MQ-QUEUE-NAME PIC X(48)
VALUE 'MEDCLAIM.HSA.PAYMENTS'.
01 WS-MQ-RESP PIC S9(08) COMP.
01 WS-MQ-REASON PIC S9(08) COMP.
01 WS-HSA-FLAG PIC X(01).
88 WS-IS-HSA-ELIGIBLE VALUE 'Y'.
88 WS-NOT-HSA-ELIGIBLE VALUE 'N'.
01 WS-MSG-ID-WORK.
05 WS-MSG-PREFIX PIC X(03) VALUE 'MSG'.
05 WS-MSG-DATE PIC 9(08).
05 WS-MSG-TIME PIC 9(06).
05 WS-MSG-SEQ PIC 9(03).
01 WS-MSG-SEQUENCE PIC 9(03) VALUE 0.
*--- JSON data structure for MQ message ---
01 WS-HSA-PAYMENT-MSG.
05 HSA-MSG-ID PIC X(20).
05 HSA-MSG-TYPE PIC X(15)
VALUE 'HSA_PAYMENT'.
05 HSA-TIMESTAMP PIC X(23).
05 HSA-CLAIM-ID PIC X(15).
05 HSA-MEMBER-ID PIC X(12).
05 HSA-ACCOUNT-ID PIC X(10).
05 HSA-PAYMENT-AMOUNT PIC 9(07)V99.
05 HSA-DIAG-CODE PIC X(07).
05 HSA-PROC-CODE PIC X(05).
05 HSA-SERVICE-DATE PIC X(10).
05 HSA-PROVIDER-NAME PIC X(30).
05 HSA-ADJUD-DATE PIC X(10).
01 WS-CURRENT-TIMESTAMP PIC X(23).
PROCEDURE DIVISION.
* ... (existing adjudication logic preserved) ...
*================================================================*
* NEW PARAGRAPH: Send HSA payment event via MQ *
* Called after successful claim adjudication if HSA-eligible *
*================================================================*
5000-SEND-HSA-EVENT.
* Generate unique message ID
ACCEPT WS-MSG-DATE FROM DATE YYYYMMDD
ACCEPT WS-MSG-TIME FROM TIME
ADD 1 TO WS-MSG-SEQUENCE
MOVE WS-MSG-SEQUENCE TO WS-MSG-SEQ
STRING WS-MSG-PREFIX DELIMITED BY SIZE
WS-MSG-DATE DELIMITED BY SIZE
WS-MSG-TIME DELIMITED BY SIZE
WS-MSG-SEQ DELIMITED BY SIZE
INTO HSA-MSG-ID
END-STRING
* Build message payload
MOVE CLM-CLAIM-ID TO HSA-CLAIM-ID
MOVE CLM-MEMBER-ID TO HSA-MEMBER-ID
MOVE CLM-PAID-AMOUNT TO HSA-PAYMENT-AMOUNT
MOVE CLM-DIAGNOSIS-CODE TO HSA-DIAG-CODE
MOVE CLM-PROCEDURE-CODE TO HSA-PROC-CODE
* Generate JSON from COBOL data structure
JSON GENERATE WS-MQ-MESSAGE
FROM WS-HSA-PAYMENT-MSG
COUNT WS-MQ-MSG-LENGTH
END-JSON
* Put message on MQ queue via CICS
EXEC CICS WRITEQ TD
QUEUE(WS-MQ-QUEUE-NAME)
FROM(WS-MQ-MESSAGE)
LENGTH(WS-MQ-MSG-LENGTH)
RESP(WS-MQ-RESP)
END-EXEC
IF WS-MQ-RESP NOT = DFHRESP(NORMAL)
DISPLAY 'MQ PUT FAILED FOR CLAIM: '
CLM-CLAIM-ID
' RESP: ' WS-MQ-RESP
* Log failure but DO NOT fail the adjudication
* The batch path will still process this claim
PERFORM 5100-LOG-MQ-FAILURE
END-IF
.
5100-LOG-MQ-FAILURE.
* Write to error log - batch will handle this claim
DISPLAY 'MQ-FAIL: ' CLM-CLAIM-ID
' AMOUNT: ' CLM-PAID-AMOUNT
' RESP: ' WS-MQ-RESP
.
Critical Design Decision: MQ Failure Does Not Fail Adjudication.
Notice paragraph 5000-SEND-HSA-EVENT: if the MQ PUT fails, the program logs the failure but does NOT reject the claim or abort processing. The claim is still adjudicated and written to the flat file. The batch path will process it normally. This is essential during the parallel-run period — the real-time path is additive, not replacing batch.
🔗 Theme: Defensive Programming. The "belt and suspenders" approach — sending the message AND writing the flat file — ensures that no transaction is lost even if the real-time path fails completely. During the parallel-run period, both paths process every transaction. After cutover, the flat file path is disabled, but the error handling in paragraph 5100 ensures that MQ failures are always logged and can trigger fallback processing.
The HSA Account Lookup Problem
One detail that consumed more design time than expected was determining the HSA account ID for a given claim. The CLM-ADJUD program knows the member ID and the claim details, but it does not know which GlobalBank HSA account corresponds to that member. This information lives on GlobalBank's side, not MedClaim's.
Priya's team considered three approaches:
Option A: Include the HSA Account ID in the MQ message. This requires MedClaim to maintain a cross-reference table mapping member IDs to HSA account IDs. The table would need to be synchronized whenever GlobalBank creates or closes an HSA account.
Option B: Let GlobalBank look up the HSA account. The MQ message includes only the member ID, and GlobalBank's consumer program looks up the corresponding HSA account using a DB2 query.
Option C: Include the HSA Account ID in MedClaim's member file. Add a field to MedClaim's member record that stores the GlobalBank HSA account ID. This requires a copybook change and a one-time data migration.
The team chose Option B — letting GlobalBank's consumer perform the lookup. The reasoning:
- Data ownership. The HSA account ID belongs to GlobalBank. MedClaim should not maintain a copy that could become stale.
- Simplicity. No cross-reference table to maintain, no synchronization process to build.
- Performance. The lookup is a simple indexed DB2 query — less than 1 millisecond.
- Isolation. If GlobalBank changes their account numbering scheme, only their consumer program changes. MedClaim is unaffected.
This decision exemplifies a core principle of event-driven design: the message should contain what the producer knows, not what the consumer needs. The consumer is responsible for enriching the message with data from its own domain.
Message Sequencing and Ordering
An important question for any messaging system is whether message ordering matters. For HSA payments, it does — but not in the way you might expect.
If the same member has two claims adjudicated within seconds, MedClaim puts two messages on the queue. GlobalBank might process them in any order. Does this matter?
For HSA debits, the order does not affect correctness — $100 deducted then $200 deducted produces the same final balance as $200 then $100. However, the order does affect the intermediate state: if the account has only $150, the first order succeeds then fails, while the second order fails then succeeds.
The team decided that message ordering is not guaranteed and not required. Each message is processed independently. If an HSA account has insufficient funds, the debit fails and the message is handled as an error — regardless of whether other messages for the same account are waiting on the queue.
This decision simplifies the architecture enormously. Guaranteeing message ordering across two LPARs connected by MQ would require single-threaded processing, eliminating the scalability benefits of message queuing. By designing each message to be independently processable, the team can run multiple consumer instances if volume grows.
💡 The Independence Principle. When designing event-driven systems, strive for messages that can be processed independently. If Message B can only be processed after Message A, your system has an implicit ordering dependency that will cause problems under load, during recovery, and when scaling. Design your messages so that each one carries enough context to be processed in isolation.
Testing the Producer in Isolation
Before connecting to GlobalBank's consumer, James tests the producer in isolation. He configures a test queue on MedClaim's LPAR and runs CLM-ADJUD with test claims that have HSA-eligible flags.
The test plan:
| Test Case | Input | Expected Result |
|---|---|---|
| Normal HSA claim | HSA-eligible claim, $500 | Message on queue with correct JSON |
| Non-HSA claim | Non-HSA claim | No message on queue |
| Large amount | HSA-eligible, $99,999.99 | Message with correct amount formatting |
| Zero amount | HSA-eligible, $0.00 | No message (zero amounts filtered) |
| MQ down | HSA-eligible claim, queue unavailable | Claim adjudicated, MQ failure logged |
| Rapid fire | 100 HSA claims in quick succession | 100 unique messages, no duplicates |
| Special characters | Provider name with apostrophe | JSON properly escaped |
James runs each test case and examines the messages on the queue using the MQ Explorer utility. He verifies that:
- Each message is valid JSON
- The message ID is unique for every message
- The claim amount matches the adjudicated amount exactly (COMP-3 to display conversion)
- The timestamp reflects the actual time of adjudication, not some default value
- No message is generated when the MQ PUT fails (the failure is logged instead)
The "MQ down" test is particularly important. James stops the queue manager, submits a batch of HSA-eligible claims, and verifies that every claim is still adjudicated correctly and written to the flat file. The only difference is the MQ failure messages in the job log. When he restarts the queue manager, the missed claims will be caught by the reconciliation process — they appear in the batch output but not in the real-time DB2 table, and the reconciliation report flags them as "batch only."
📊 Testing Philosophy: Trust But Verify. Testing the producer in isolation before connecting it to the consumer follows the same principle as unit testing before integration testing. If the producer generates malformed messages, debugging will be much harder when the consumer is involved. By verifying message format, uniqueness, and error handling before the consumer exists, James eliminates an entire class of integration problems.
Phase 3: GlobalBank Event Consumer
The HSA-EVENTS Program
At GlobalBank, Derek Washington builds the event consumer under Maria Chen's supervision. HSA-EVENTS is a CICS program that is triggered when messages arrive on the MQ queue. It reads the message, parses the JSON, validates the HSA account, applies the debit, and sends a confirmation message back.
IDENTIFICATION DIVISION.
PROGRAM-ID. HSA-EVENTS.
*================================================================*
* Program: HSA-EVENTS *
* Purpose: Process real-time HSA payment events from MedClaim *
* Trigger: MQ message arrival on MEDCLAIM.HSA.PAYMENTS *
* Author: Derek Washington (supervised by Maria Chen) *
* Date: 2024-07-15 *
* System: GlobalBank Core Banking *
*================================================================*
DATA DIVISION.
WORKING-STORAGE SECTION.
01 WS-MQ-MESSAGE PIC X(2000).
01 WS-MQ-MSG-LENGTH PIC S9(08) COMP.
01 WS-PAYMENT-QUEUE PIC X(48)
VALUE 'MEDCLAIM.HSA.PAYMENTS'.
01 WS-CONFIRM-QUEUE PIC X(48)
VALUE 'GLOBALBANK.HSA.CONFIRMS'.
01 WS-RESP-CODE PIC S9(08) COMP.
*--- Parsed payment request ---
01 WS-PAYMENT-REQUEST.
05 WS-REQ-MSG-ID PIC X(20).
05 WS-REQ-MSG-TYPE PIC X(15).
05 WS-REQ-TIMESTAMP PIC X(23).
05 WS-REQ-CLAIM-ID PIC X(15).
05 WS-REQ-MEMBER-ID PIC X(12).
05 WS-REQ-HSA-ACCT PIC X(10).
05 WS-REQ-PAY-AMOUNT PIC S9(7)V99 COMP-3.
05 WS-REQ-DIAG-CODE PIC X(07).
05 WS-REQ-PROC-CODE PIC X(05).
05 WS-REQ-SVC-DATE PIC X(10).
05 WS-REQ-PROVIDER PIC X(30).
05 WS-REQ-ADJUD-DATE PIC X(10).
*--- Confirmation response ---
01 WS-CONFIRM-RESPONSE.
05 WS-CFM-MSG-ID PIC X(20).
05 WS-CFM-CORREL-ID PIC X(20).
05 WS-CFM-MSG-TYPE PIC X(15)
VALUE 'HSA_CONFIRM'.
05 WS-CFM-TIMESTAMP PIC X(23).
05 WS-CFM-CLAIM-ID PIC X(15).
05 WS-CFM-HSA-ACCT PIC X(10).
05 WS-CFM-STATUS PIC X(10).
05 WS-CFM-NEW-BALANCE PIC S9(9)V99 COMP-3.
05 WS-CFM-TXN-REF PIC X(20).
01 WS-CONFIRM-JSON PIC X(2000).
01 WS-CONFIRM-LENGTH PIC S9(08) COMP.
*--- HSA Account fields (from DB2) ---
01 WS-HSA-FIELDS.
05 WS-HSA-BALANCE PIC S9(9)V99 COMP-3.
05 WS-HSA-STATUS PIC X(01).
88 WS-HSA-ACTIVE VALUE 'A'.
05 WS-HSA-MEMBER-ID PIC X(12).
05 WS-HSA-NEW-BALANCE PIC S9(9)V99 COMP-3.
*--- Duplicate check ---
01 WS-DUP-CHECK-COUNT PIC S9(08) COMP.
*--- Work fields ---
01 WS-TXN-REF PIC X(20).
01 WS-CURRENT-TS PIC X(23).
01 WS-PROCESS-STATUS PIC X(10).
EXEC SQL INCLUDE SQLCA END-EXEC.
PROCEDURE DIVISION.
0000-MAIN.
PERFORM 1000-RECEIVE-MESSAGE
PERFORM 2000-PARSE-REQUEST
PERFORM 3000-VALIDATE-REQUEST
PERFORM 4000-PROCESS-PAYMENT
PERFORM 5000-SEND-CONFIRMATION
EXEC CICS RETURN END-EXEC
.
1000-RECEIVE-MESSAGE.
EXEC CICS READQ TD
QUEUE(WS-PAYMENT-QUEUE)
INTO(WS-MQ-MESSAGE)
LENGTH(WS-MQ-MSG-LENGTH)
RESP(WS-RESP-CODE)
END-EXEC
IF WS-RESP-CODE NOT = DFHRESP(NORMAL)
DISPLAY 'HSA-EVENTS: MQ READ FAILED, RESP='
WS-RESP-CODE
EXEC CICS RETURN END-EXEC
END-IF
.
2000-PARSE-REQUEST.
JSON PARSE WS-MQ-MESSAGE
INTO WS-PAYMENT-REQUEST
END-JSON
.
3000-VALIDATE-REQUEST.
* Check for duplicate message (idempotent processing)
EXEC SQL
SELECT COUNT(*)
INTO :WS-DUP-CHECK-COUNT
FROM GLOBALBANK.HSA_TRANSACTIONS
WHERE MESSAGE_ID = :WS-REQ-MSG-ID
END-EXEC
IF WS-DUP-CHECK-COUNT > 0
MOVE 'DUPLICATE' TO WS-PROCESS-STATUS
DISPLAY 'HSA-EVENTS: DUPLICATE MESSAGE '
WS-REQ-MSG-ID
GO TO 3000-EXIT
END-IF
* Validate HSA account exists and is active
EXEC SQL
SELECT HSA_BALANCE,
HSA_STATUS,
MEMBER_ID
INTO :WS-HSA-BALANCE,
:WS-HSA-STATUS,
:WS-HSA-MEMBER-ID
FROM GLOBALBANK.HSA_ACCOUNTS
WHERE HSA_ACCOUNT_ID = :WS-REQ-HSA-ACCT
END-EXEC
EVALUATE SQLCODE
WHEN 0
IF NOT WS-HSA-ACTIVE
MOVE 'INACTIVE' TO WS-PROCESS-STATUS
ELSE IF WS-REQ-PAY-AMOUNT > WS-HSA-BALANCE
MOVE 'NSF' TO WS-PROCESS-STATUS
ELSE
MOVE 'VALIDATED' TO WS-PROCESS-STATUS
END-IF
WHEN +100
MOVE 'ACCT_NF' TO WS-PROCESS-STATUS
WHEN OTHER
MOVE 'DB2_ERROR' TO WS-PROCESS-STATUS
END-EVALUATE
.
3000-EXIT.
EXIT
.
4000-PROCESS-PAYMENT.
IF WS-PROCESS-STATUS NOT = 'VALIDATED'
GO TO 4000-EXIT
END-IF
* Debit the HSA account
COMPUTE WS-HSA-NEW-BALANCE =
WS-HSA-BALANCE - WS-REQ-PAY-AMOUNT
END-COMPUTE
EXEC SQL
UPDATE GLOBALBANK.HSA_ACCOUNTS
SET HSA_BALANCE = :WS-HSA-NEW-BALANCE,
LAST_ACTIVITY_TS = CURRENT TIMESTAMP
WHERE HSA_ACCOUNT_ID = :WS-REQ-HSA-ACCT
AND HSA_BALANCE = :WS-HSA-BALANCE
END-EXEC
IF SQLCODE = 0 AND SQLERRD(3) = 1
* Exactly one row updated - success
MOVE 'SUCCESS' TO WS-PROCESS-STATUS
MOVE WS-HSA-NEW-BALANCE TO WS-CFM-NEW-BALANCE
* Record the transaction
PERFORM 4100-INSERT-TRANSACTION
ELSE
* Optimistic lock failure or error
MOVE 'CONFLICT' TO WS-PROCESS-STATUS
EXEC SQL ROLLBACK END-EXEC
END-IF
.
4000-EXIT.
EXIT
.
4100-INSERT-TRANSACTION.
EXEC SQL
INSERT INTO GLOBALBANK.HSA_TRANSACTIONS
(MESSAGE_ID, CLAIM_ID, HSA_ACCOUNT_ID,
PAYMENT_AMOUNT, NEW_BALANCE, PROCESS_STATUS,
PROCESS_TS)
VALUES
(:WS-REQ-MSG-ID, :WS-REQ-CLAIM-ID,
:WS-REQ-HSA-ACCT, :WS-REQ-PAY-AMOUNT,
:WS-HSA-NEW-BALANCE, 'SUCCESS',
CURRENT TIMESTAMP)
END-EXEC
IF SQLCODE = 0
EXEC SQL COMMIT END-EXEC
ELSE
EXEC SQL ROLLBACK END-EXEC
MOVE 'LOG_FAIL' TO WS-PROCESS-STATUS
END-IF
.
5000-SEND-CONFIRMATION.
* Build confirmation message
MOVE WS-REQ-MSG-ID TO WS-CFM-CORREL-ID
MOVE WS-REQ-CLAIM-ID TO WS-CFM-CLAIM-ID
MOVE WS-REQ-HSA-ACCT TO WS-CFM-HSA-ACCT
MOVE WS-PROCESS-STATUS TO WS-CFM-STATUS
* Generate unique confirmation message ID
STRING 'CFM' DELIMITED BY SIZE
WS-REQ-MSG-ID(4:17) DELIMITED BY SIZE
INTO WS-CFM-MSG-ID
END-STRING
* Generate JSON confirmation
JSON GENERATE WS-CONFIRM-JSON
FROM WS-CONFIRM-RESPONSE
COUNT WS-CONFIRM-LENGTH
END-JSON
* Put confirmation on return queue
EXEC CICS WRITEQ TD
QUEUE(WS-CONFIRM-QUEUE)
FROM(WS-CONFIRM-JSON)
LENGTH(WS-CONFIRM-LENGTH)
RESP(WS-RESP-CODE)
END-EXEC
IF WS-RESP-CODE NOT = DFHRESP(NORMAL)
DISPLAY 'HSA-EVENTS: CONFIRM PUT FAILED, RESP='
WS-RESP-CODE
* Log failure - reconciliation will catch this
END-IF
.
Key Design Patterns in HSA-EVENTS:
Idempotent processing with duplicate check. Before processing any message, the program checks whether a message with the same ID has already been processed (by querying the HSA_TRANSACTIONS table). If it has, the message is skipped. This makes the system safe against duplicate message delivery.
Optimistic locking on the UPDATE. The UPDATE statement includes AND HSA_BALANCE = :WS-HSA-BALANCE — a condition that fails if another transaction modified the balance between the SELECT and the UPDATE. This is optimistic locking: instead of taking a database lock during the SELECT, the program assumes it will succeed and verifies at UPDATE time. If the balance changed, the UPDATE affects zero rows (SQLERRD(3) = 0), and the program handles it as a conflict.
COMMIT after transaction recording. The payment and its transaction record are committed together. If either fails, both are rolled back. This ensures that the transaction log is always consistent with the account balance.
🔵 Why Not VSAM? Notice that HSA-EVENTS uses DB2, not VSAM, for account data. This is a deliberate choice for a real-time system. DB2 provides row-level locking (VSAM locks at the CI level), SQL access for ad-hoc queries, and built-in recovery logging. For a high-concurrency, real-time system, DB2's capabilities are essential.
Understanding IBM MQ for COBOL Developers
What is Message Queuing?
If you come from a batch background (as most COBOL developers do), message queuing requires a mental shift. In batch, programs communicate through files: Program A writes a file, and Program B reads it. The file sits on disk until the next job step runs. Communication is synchronous with the job stream — Program B cannot run until Program A completes.
Message queuing is different. Program A puts a message on a queue and continues processing immediately. Program B reads from the queue whenever it is ready — which might be milliseconds later or hours later. The queue manager (IBM MQ) guarantees that the message is delivered, even if Program B is temporarily unavailable.
This decoupling is the fundamental advantage of message queuing:
- Program A does not need to know if Program B is running. If B is down, messages accumulate on the queue and are processed when B comes back up.
- Program A does not wait for Program B to finish. A puts the message and moves on to the next claim.
- Multiple instances of Program B can read from the same queue. If processing is slow, you can start additional consumers to handle the load.
MQ from COBOL: Two Approaches
COBOL programs can interact with MQ in two ways:
1. MQ API (MQI — Message Queue Interface): Direct calls to MQ using CALL statements. This provides full control over message options (persistence, priority, expiry, correlation) but requires managing connection handles, queue handles, and message descriptors.
* MQI approach - full control, more code
CALL 'MQCONN' USING WS-QM-NAME
WS-HCONN
WS-COMP-CODE
WS-REASON
CALL 'MQOPEN' USING WS-HCONN
WS-OBJ-DESC
WS-OPEN-OPTIONS
WS-HOBJ
WS-COMP-CODE
WS-REASON
CALL 'MQPUT' USING WS-HCONN
WS-HOBJ
WS-MSG-DESC
WS-PUT-OPTIONS
WS-MSG-LENGTH
WS-MSG-BUFFER
WS-COMP-CODE
WS-REASON
2. CICS MQ Bridge: In a CICS environment, MQ operations can be performed using EXEC CICS READQ TD (transient data) or the CICS-MQ adapter. This is simpler but offers less control.
* CICS approach - simpler, CICS manages the connection
EXEC CICS WRITEQ TD
QUEUE(WS-QUEUE-NAME)
FROM(WS-MESSAGE)
LENGTH(WS-MSG-LENGTH)
RESP(WS-RESP-CODE)
END-EXEC
For Derek's HSA-EVENTS program, the CICS approach is appropriate because the program runs in a CICS region. For a batch program that puts messages on a queue, the MQI approach is typically used.
Message Design Principles
Priya establishes three message design principles for the HSA system:
Principle 1: Messages are self-contained. Each message contains all information needed to process it. The consumer should not need to make additional calls to get missing data. This is why the payment request includes member name, diagnosis code, and provider name — even though the consumer could look them up. Self-contained messages reduce coupling between systems and improve reliability.
Principle 2: Messages are immutable. Once a message is put on the queue, its content does not change. If an error is discovered, a new corrective message is sent — the original message is never modified. This ensures audit trail integrity and simplifies debugging.
Principle 3: Messages are versioned. The message includes a version field (implicit in the message type, like "HSA_PAYMENT"). If the message format changes in the future, the consumer can detect the version and process accordingly. This allows rolling upgrades where producers and consumers are updated at different times.
Dead-Letter Queue Processing
The dead-letter queue (DLQ) is where messages go when they cannot be processed. This can happen for several reasons:
- The consuming program ABENDs repeatedly while processing the message (BOTHRESH exceeded)
- The message format is invalid (the JSON PARSE fails)
- The target queue is full (MAXDEPTH reached)
- The message has expired (EXPIRY time exceeded)
Dead-letter queue processing is an operational concern, not an application concern. But the development team must design for it:
* In HSA-EVENTS: handle JSON parse failure gracefully
JSON PARSE WS-MQ-MESSAGE
INTO WS-PAYMENT-REQUEST
END-JSON
* Check for parse errors
IF JSON-STATUS NOT = 0
DISPLAY 'HSA-EVENTS: JSON PARSE FAILED FOR MSG'
* Do NOT process this message
* Allow MQ backout threshold to move it to DLQ
EXEC CICS ABEND ABCODE('JSNP') END-EXEC
END-IF
By ABENDing with a specific code ('JSNP'), the program tells MQ that this message could not be processed. After BOTHRESH attempts, MQ moves it to the DLQ. Operations can then examine the DLQ, investigate the malformed message, and take corrective action.
📊 DLQ Monitoring. In production, the DLQ depth should always be zero. Any message on the DLQ represents a transaction that was not processed — which in a financial system means money that was not moved, a payment that was not made, or an account that was not updated. Priya configures the HSA-MONITOR program to alert immediately if the DLQ contains any messages.
Transactional Messaging
In the HSA-EVENTS program, the DB2 update and the MQ confirmation PUT must be coordinated. If the DB2 update succeeds but the MQ PUT fails, the account has been debited but MedClaim does not know it. If the MQ PUT succeeds but the DB2 update fails, MedClaim thinks the payment was made but it was not.
The solution is transactional messaging: MQ and DB2 participate in the same CICS unit of work. When CICS commits, both the DB2 change and the MQ PUT are committed atomically. If either fails, both are rolled back.
* Both DB2 and MQ participate in the CICS unit of work
* When we issue EXEC CICS SYNCPOINT, both commit together
* Step 1: Update DB2 (within CICS UOW)
EXEC SQL UPDATE GLOBALBANK.HSA_ACCOUNTS ... END-EXEC
* Step 2: Put MQ message (within same CICS UOW)
EXEC CICS WRITEQ TD
QUEUE(WS-CONFIRM-QUEUE) ...
END-EXEC
* Step 3: Commit both atomically
EXEC CICS SYNCPOINT
RESP(WS-RESP-CODE)
END-EXEC
* If SYNCPOINT fails, both DB2 and MQ are rolled back
IF WS-RESP-CODE NOT = DFHRESP(NORMAL)
DISPLAY 'HSA-EVENTS: SYNCPOINT FAILED'
* Both the DB2 update and the MQ message are
* rolled back - data integrity is preserved
END-IF
This is the mainframe's answer to the distributed transaction problem. CICS acts as a transaction coordinator, and both DB2 and MQ are resource managers that participate in the two-phase commit protocol. The COBOL programmer does not need to understand the protocol details — they just use EXEC CICS SYNCPOINT and CICS handles the rest.
⚠️ The Two-Phase Commit Overhead. Transactional messaging adds overhead: each commit requires coordination between CICS, DB2, and MQ. For the HSA system, this overhead is negligible (a few milliseconds per transaction). But for high-volume systems processing thousands of messages per second, the overhead can be significant. In such cases, careful design — batching commits, using non-persistent messages for intermediate steps — can reduce the impact.
Error Handling Patterns in Event-Driven COBOL
Event-driven systems introduce error handling patterns that batch COBOL programmers may not have encountered. In batch, an error typically means writing an error record and continuing to the next input record, or in severe cases, ABENDing the job. In event-driven processing, errors must be handled with more nuance because messages are independent and the system must continue processing other messages even when one fails.
Pattern 1: Retry with Backoff.
Some errors are transient — a DB2 deadlock, a temporary resource contention, a network glitch. The correct response is to retry the operation after a brief delay. In CICS, this is accomplished by rolling back the current message (allowing MQ to redeliver it) and using the backout threshold to limit retries:
* Pattern: Detect transient error and allow retry
EVALUATE SQLCODE
WHEN -911
* Deadlock or timeout - transient error
* Rollback and let MQ redeliver
EXEC CICS SYNCPOINT ROLLBACK
RESP(WS-RESP-CODE)
END-EXEC
* The message goes back to the queue
* MQ will redeliver after a brief delay
* After BOTHRESH attempts, DLQ
WHEN -904
* Resource unavailable - also transient
EXEC CICS SYNCPOINT ROLLBACK
RESP(WS-RESP-CODE)
END-EXEC
WHEN OTHER
* Permanent error - do not retry
PERFORM 9100-LOG-PERMANENT-ERROR
* Commit the GET (remove message from queue)
* The error is logged; reconciliation will catch it
EXEC CICS SYNCPOINT
RESP(WS-RESP-CODE)
END-EXEC
END-EVALUATE
The key distinction is between transient and permanent errors. Retrying a permanent error (like a member not found, or an invalid claim ID) will never succeed and wastes resources. The program must classify each error and respond appropriately.
Pattern 2: Compensating Transactions.
What happens if the HSA debit succeeds but the confirmation message fails to send? The account has been debited, but MedClaim does not know it. A compensating transaction reverses the debit:
* Pattern: Compensating transaction
* DB2 update succeeded but MQ confirm failed
* Reverse the DB2 update
EXEC SQL
UPDATE GLOBALBANK.HSA_ACCOUNTS
SET HSA_BALANCE = HSA_BALANCE
+ :WS-PAYMENT-AMOUNT
WHERE HSA_ACCOUNT_ID = :WS-HSA-ACCOUNT-ID
END-EXEC
* Log the compensation
DISPLAY 'HSA-EVENTS: COMPENSATING TXN FOR '
WS-HSA-ACCOUNT-ID
' AMOUNT: ' WS-PAYMENT-AMOUNT
In practice, the HSA-EVENTS program uses transactional messaging (SYNCPOINT) to avoid this scenario. But compensating transactions are important in systems where the two resources (DB2 and MQ) cannot participate in the same unit of work — for example, when communicating with an external system via HTTP.
Pattern 3: The Error Event.
Instead of silently swallowing errors or ABENDing, event-driven systems can publish error events. These are messages placed on a dedicated error queue that describe what went wrong, when, and for which transaction:
* Pattern: Publish error event
9200-PUBLISH-ERROR-EVENT.
INITIALIZE WS-ERROR-EVENT
MOVE WS-ORIGINAL-MSG-ID TO ERR-ORIGINAL-MSG-ID
MOVE WS-CLAIM-ID TO ERR-CLAIM-ID
MOVE WS-ERROR-CODE TO ERR-ERROR-CODE
MOVE WS-ERROR-DESC TO ERR-ERROR-DESC
ACCEPT ERR-TIMESTAMP FROM TIME
JSON GENERATE WS-ERROR-JSON
FROM WS-ERROR-EVENT
COUNT WS-ERROR-JSON-LEN
END-JSON
EXEC CICS WRITEQ TD
QUEUE('GLOBALBANK.HSA.ERRORS')
FROM(WS-ERROR-JSON)
LENGTH(WS-ERROR-JSON-LEN)
RESP(WS-RESP-CODE)
END-EXEC
.
Error events are invaluable for operational monitoring. Instead of grepping through CICS logs for error messages, the operations team monitors the error queue. Automated tooling can consume error events and create tickets, send alerts, or trigger corrective processes.
Performance Considerations for Real-Time COBOL
Moving from batch to real-time changes the performance characteristics of a COBOL program in fundamental ways.
Batch Performance: Measured in throughput — records per second. A batch program processes millions of records over hours. Individual record processing time does not matter as long as the total batch window is met.
Real-Time Performance: Measured in latency — milliseconds per transaction. Every millisecond counts because the end user or partner system is waiting for a response. A batch program that processes 10,000 records per second (0.1ms each) is fast. A real-time program that takes 100ms per transaction may be too slow.
The HSA-EVENTS program targets a per-transaction latency of under 50ms. Priya's performance analysis breaks this down:
| Component | Target (ms) | Notes |
|---|---|---|
| MQ GET | 2-5 | Network + disk I/O for persistent message |
| JSON PARSE | 1-2 | CPU-bound, depends on message size |
| Duplicate check (DB2) | 3-8 | Index lookup on MESSAGE_ID |
| Account SELECT (DB2) | 3-8 | Primary key lookup |
| Account UPDATE (DB2) | 5-10 | Row lock + log write |
| Transaction INSERT (DB2) | 3-8 | Index maintenance |
| JSON GENERATE | 1-2 | CPU-bound |
| MQ PUT (confirm) | 2-5 | Network + disk I/O |
| SYNCPOINT | 5-15 | Two-phase commit (DB2 + MQ) |
| Total | 25-63 | Target: < 50ms average |
Tuning Strategies:
-
DB2 Buffer Pool Sizing. The HSA_ACCOUNTS table is small enough to fit entirely in the DB2 buffer pool. If every row is cached in memory, the SELECT avoids physical I/O entirely. Priya works with GlobalBank's DBA to allocate a dedicated buffer pool (BP1) for the HSA tables with enough pages to hold the entire dataset.
-
MQ Non-Persistent for Confirmations. While payment requests must be persistent (losing a payment message is unacceptable), confirmation messages could theoretically be non-persistent. If a confirmation is lost, the reconciliation process will detect the missing confirmation and regenerate it. However, the team decides to keep confirmations persistent — the performance difference (2-3ms) is not worth the operational complexity of missing confirmations.
-
DB2 Static SQL. The HSA-EVENTS program uses static SQL (embedded in EXEC SQL blocks), not dynamic SQL. Static SQL is precompiled and optimized at BIND time, avoiding the overhead of runtime SQL parsing. For a program that executes the same queries millions of times, static SQL can be 2-5x faster than dynamic SQL.
-
CICS Storage Management. Each invocation of HSA-EVENTS allocates working storage. By keeping working storage small and avoiding unnecessary GETMAIN/FREEMAIN calls, the program minimizes CICS storage overhead. The WS-MQ-MESSAGE field is sized at 2000 bytes — large enough for the largest expected message, small enough to avoid waste.
Capacity Planning
Priya builds a capacity model to ensure the real-time system can handle current and projected volumes:
Current volumes: - 20,000 claims adjudicated per business day - ~5,000 are HSA-eligible (25%) - Processing concentrated in business hours (10 hours) - Average rate: 500 HSA messages per hour = ~8.3 per minute
Projected growth (3 years): - Claims volume expected to grow 15% annually - New partner integrations may add 30% more HSA-eligible claims - Projected peak: ~15,000 HSA messages per day = 25 per minute
Capacity analysis: - At 50ms per transaction, one CICS transaction instance can process 20 per second = 1,200 per minute - Current volume (8.3/minute) uses less than 1% of capacity - Even at projected peak (25/minute), utilization is approximately 2% - The system has enormous headroom — a factor of 50x between projected peak and single-instance capacity
"This is one of the advantages of the mainframe," Priya explains to the project steering committee. "The z/OS hardware and CICS infrastructure can handle transaction rates that would require a cluster of distributed servers. We have capacity for growth that goes well beyond our 3-year horizon."
Burst capacity is the more relevant concern. Claims are not adjudicated evenly throughout the day. The peak hour may see 3x the average rate. And if MedClaim runs a re-adjudication batch (reprocessing previously denied claims), thousands of messages may arrive in minutes.
The team establishes monitoring thresholds:
| Metric | Normal | Warning | Critical |
|---|---|---|---|
| Queue depth | < 100 | 100-1000 | > 1000 |
| Processing latency (avg) | < 30ms | 30-100ms | > 100ms |
| Error rate | 0% | < 0.1% | > 0.1% |
| DLQ depth | 0 | 1-5 | > 5 |
The HSA-MONITOR program checks these thresholds every 5 minutes and sends alerts when warning or critical levels are reached.
Phase 4: Parallel Running
The Reconciliation Architecture
During the parallel-run period, both the batch and real-time paths process every HSA payment. The reconciliation program compares the results nightly.
MedClaim Adjudication
│ │
[MQ Message] [Flat File]
│ │
[HSA-EVENTS] [HSA-PROC Batch]
│ │
DB2: HSA_TXN VSAM: Audit Trail
│ │
└──── [HSA-RECON] ────┘
│
Reconciliation Report
The Reconciliation Program
HSA-RECON reads both the DB2 transaction log (from the real-time path) and the VSAM audit trail (from the batch path) and compares them claim-by-claim.
IDENTIFICATION DIVISION.
PROGRAM-ID. HSA-RECON.
*================================================================*
* Program: HSA-RECON *
* Purpose: Reconcile real-time and batch HSA processing *
* Author: Priya Kapoor *
* Date: 2024-08-01 *
*================================================================*
* Compares DB2 transaction log (real-time) with VSAM audit *
* trail (batch) to verify both paths produce identical results. *
*================================================================*
ENVIRONMENT DIVISION.
INPUT-OUTPUT SECTION.
FILE-CONTROL.
SELECT BATCH-AUDIT
ASSIGN TO BCHAUDIT
ORGANIZATION IS SEQUENTIAL
FILE STATUS IS WS-BATCH-STATUS.
SELECT RECON-REPORT
ASSIGN TO RECONRPT
ORGANIZATION IS SEQUENTIAL
FILE STATUS IS WS-RPT-STATUS.
DATA DIVISION.
FILE SECTION.
FD BATCH-AUDIT
RECORDING MODE IS F
RECORD CONTAINS 150 CHARACTERS.
01 BATCH-AUDIT-REC.
05 BA-CLAIM-ID PIC X(15).
05 BA-HSA-ACCT PIC X(10).
05 BA-AMOUNT PIC S9(7)V99 COMP-3.
05 BA-STATUS PIC X(01).
05 BA-PROCESS-DATE PIC 9(08).
05 FILLER PIC X(111).
FD RECON-REPORT
RECORDING MODE IS F
RECORD CONTAINS 132 CHARACTERS.
01 RECON-LINE PIC X(132).
WORKING-STORAGE SECTION.
01 WS-FILE-STATUSES.
05 WS-BATCH-STATUS PIC X(02).
88 WS-BATCH-OK VALUE '00'.
88 WS-BATCH-EOF VALUE '10'.
05 WS-RPT-STATUS PIC X(02).
*--- DB2 cursor for real-time transactions ---
01 WS-RT-FIELDS.
05 WS-RT-CLAIM-ID PIC X(15).
05 WS-RT-HSA-ACCT PIC X(10).
05 WS-RT-AMOUNT PIC S9(7)V99 COMP-3.
05 WS-RT-STATUS PIC X(10).
05 WS-RT-PROCESS-DATE PIC X(10).
01 WS-COUNTERS.
05 WS-BATCH-COUNT PIC 9(07) VALUE ZERO.
05 WS-REALTIME-COUNT PIC 9(07) VALUE ZERO.
05 WS-MATCH-COUNT PIC 9(07) VALUE ZERO.
05 WS-MISMATCH-COUNT PIC 9(07) VALUE ZERO.
05 WS-BATCH-ONLY PIC 9(07) VALUE ZERO.
05 WS-REALTIME-ONLY PIC 9(07) VALUE ZERO.
01 WS-RECON-DATE PIC 9(08).
01 WS-FLAGS.
05 WS-EOF-FLAG PIC X(01) VALUE 'N'.
88 WS-END-OF-BATCH VALUE 'Y'.
88 WS-MORE-BATCH VALUE 'N'.
05 WS-CURSOR-FLAG PIC X(01) VALUE 'N'.
88 WS-END-OF-CURSOR VALUE 'Y'.
88 WS-MORE-CURSOR VALUE 'N'.
01 WS-RPT-HEADER.
05 FILLER PIC X(01) VALUE SPACES.
05 FILLER PIC X(50)
VALUE 'HSA RECONCILIATION REPORT - BATCH VS REAL-TIME'.
05 FILLER PIC X(30) VALUE SPACES.
05 FILLER PIC X(06) VALUE 'DATE: '.
05 RH-DATE PIC X(10).
05 FILLER PIC X(35) VALUE SPACES.
01 WS-RPT-DETAIL.
05 FILLER PIC X(01) VALUE SPACES.
05 RD-TYPE PIC X(10).
05 FILLER PIC X(02) VALUE SPACES.
05 RD-CLAIM PIC X(15).
05 FILLER PIC X(02) VALUE SPACES.
05 RD-ACCT PIC X(10).
05 FILLER PIC X(02) VALUE SPACES.
05 RD-B-AMT PIC -(7)9.99.
05 FILLER PIC X(02) VALUE SPACES.
05 RD-R-AMT PIC -(7)9.99.
05 FILLER PIC X(02) VALUE SPACES.
05 RD-B-STAT PIC X(01).
05 FILLER PIC X(02) VALUE SPACES.
05 RD-R-STAT PIC X(10).
05 FILLER PIC X(02) VALUE SPACES.
05 RD-RESULT PIC X(10).
05 FILLER PIC X(40) VALUE SPACES.
01 WS-RPT-SUMMARY.
05 FILLER PIC X(01) VALUE SPACES.
05 FILLER PIC X(50)
VALUE '================================================'.
05 FILLER PIC X(81) VALUE SPACES.
EXEC SQL INCLUDE SQLCA END-EXEC.
EXEC SQL DECLARE RT-CURSOR CURSOR FOR
SELECT CLAIM_ID,
HSA_ACCOUNT_ID,
PAYMENT_AMOUNT,
PROCESS_STATUS,
CHAR(PROCESS_TS, ISO)
FROM GLOBALBANK.HSA_TRANSACTIONS
WHERE DATE(PROCESS_TS) = :WS-RECON-DATE
ORDER BY CLAIM_ID
END-EXEC.
PROCEDURE DIVISION.
0000-MAIN.
PERFORM 1000-INITIALIZE
PERFORM 2000-RECONCILE
PERFORM 3000-TERMINATE
STOP RUN
.
1000-INITIALIZE.
OPEN INPUT BATCH-AUDIT
OPEN OUTPUT RECON-REPORT
ACCEPT WS-RECON-DATE FROM DATE YYYYMMDD
MOVE WS-RECON-DATE TO RH-DATE
WRITE RECON-LINE FROM WS-RPT-HEADER
AFTER ADVANCING PAGE
EXEC SQL OPEN RT-CURSOR END-EXEC
PERFORM 2100-READ-BATCH
PERFORM 2200-FETCH-REALTIME
.
2000-RECONCILE.
PERFORM UNTIL WS-END-OF-BATCH AND WS-END-OF-CURSOR
EVALUATE TRUE
WHEN WS-END-OF-BATCH AND WS-END-OF-CURSOR
CONTINUE
WHEN WS-END-OF-CURSOR
* Batch has records, real-time doesn't
PERFORM 2300-BATCH-ONLY
PERFORM 2100-READ-BATCH
WHEN WS-END-OF-BATCH
* Real-time has records, batch doesn't
PERFORM 2400-REALTIME-ONLY
PERFORM 2200-FETCH-REALTIME
WHEN BA-CLAIM-ID = WS-RT-CLAIM-ID
* Both have this claim - compare
PERFORM 2500-COMPARE-RECORDS
PERFORM 2100-READ-BATCH
PERFORM 2200-FETCH-REALTIME
WHEN BA-CLAIM-ID < WS-RT-CLAIM-ID
* Batch has a claim that real-time doesn't
PERFORM 2300-BATCH-ONLY
PERFORM 2100-READ-BATCH
WHEN BA-CLAIM-ID > WS-RT-CLAIM-ID
* Real-time has a claim that batch doesn't
PERFORM 2400-REALTIME-ONLY
PERFORM 2200-FETCH-REALTIME
END-EVALUATE
END-PERFORM
.
2100-READ-BATCH.
READ BATCH-AUDIT
EVALUATE TRUE
WHEN WS-BATCH-OK
ADD 1 TO WS-BATCH-COUNT
WHEN WS-BATCH-EOF
SET WS-END-OF-BATCH TO TRUE
WHEN OTHER
DISPLAY 'BATCH READ ERROR: ' WS-BATCH-STATUS
SET WS-END-OF-BATCH TO TRUE
END-EVALUATE
.
2200-FETCH-REALTIME.
EXEC SQL
FETCH RT-CURSOR
INTO :WS-RT-CLAIM-ID,
:WS-RT-HSA-ACCT,
:WS-RT-AMOUNT,
:WS-RT-STATUS,
:WS-RT-PROCESS-DATE
END-EXEC
EVALUATE SQLCODE
WHEN 0
ADD 1 TO WS-REALTIME-COUNT
WHEN +100
SET WS-END-OF-CURSOR TO TRUE
WHEN OTHER
DISPLAY 'DB2 FETCH ERROR: ' SQLCODE
SET WS-END-OF-CURSOR TO TRUE
END-EVALUATE
.
2300-BATCH-ONLY.
ADD 1 TO WS-BATCH-ONLY
MOVE 'BATCH-ONLY' TO RD-TYPE
MOVE BA-CLAIM-ID TO RD-CLAIM
MOVE BA-HSA-ACCT TO RD-ACCT
MOVE BA-AMOUNT TO RD-B-AMT
MOVE ZERO TO RD-R-AMT
MOVE BA-STATUS TO RD-B-STAT
MOVE SPACES TO RD-R-STAT
MOVE 'MISMATCH' TO RD-RESULT
WRITE RECON-LINE FROM WS-RPT-DETAIL
AFTER ADVANCING 1 LINE
.
2400-REALTIME-ONLY.
ADD 1 TO WS-REALTIME-ONLY
MOVE 'RT-ONLY' TO RD-TYPE
MOVE WS-RT-CLAIM-ID TO RD-CLAIM
MOVE WS-RT-HSA-ACCT TO RD-ACCT
MOVE ZERO TO RD-B-AMT
MOVE WS-RT-AMOUNT TO RD-R-AMT
MOVE SPACES TO RD-B-STAT
MOVE WS-RT-STATUS TO RD-R-STAT
MOVE 'MISMATCH' TO RD-RESULT
WRITE RECON-LINE FROM WS-RPT-DETAIL
AFTER ADVANCING 1 LINE
.
2500-COMPARE-RECORDS.
MOVE 'BOTH' TO RD-TYPE
MOVE BA-CLAIM-ID TO RD-CLAIM
MOVE BA-HSA-ACCT TO RD-ACCT
MOVE BA-AMOUNT TO RD-B-AMT
MOVE WS-RT-AMOUNT TO RD-R-AMT
MOVE BA-STATUS TO RD-B-STAT
MOVE WS-RT-STATUS TO RD-R-STAT
IF BA-AMOUNT = WS-RT-AMOUNT
AND BA-HSA-ACCT = WS-RT-HSA-ACCT
MOVE 'MATCH' TO RD-RESULT
ADD 1 TO WS-MATCH-COUNT
ELSE
MOVE 'MISMATCH' TO RD-RESULT
ADD 1 TO WS-MISMATCH-COUNT
WRITE RECON-LINE FROM WS-RPT-DETAIL
AFTER ADVANCING 1 LINE
END-IF
.
3000-TERMINATE.
WRITE RECON-LINE FROM WS-RPT-SUMMARY
AFTER ADVANCING 3 LINES
DISPLAY '======================================='
DISPLAY 'HSA RECONCILIATION SUMMARY'
DISPLAY '======================================='
DISPLAY 'BATCH TRANSACTIONS: ' WS-BATCH-COUNT
DISPLAY 'REAL-TIME TRANSACTIONS: ' WS-REALTIME-COUNT
DISPLAY 'MATCHES: ' WS-MATCH-COUNT
DISPLAY 'MISMATCHES: ' WS-MISMATCH-COUNT
DISPLAY 'BATCH-ONLY: ' WS-BATCH-ONLY
DISPLAY 'REAL-TIME-ONLY: ' WS-REALTIME-ONLY
DISPLAY '======================================='
EXEC SQL CLOSE RT-CURSOR END-EXEC
CLOSE BATCH-AUDIT
RECON-REPORT
IF WS-MISMATCH-COUNT > ZERO OR
WS-BATCH-ONLY > ZERO OR
WS-REALTIME-ONLY > ZERO
MOVE 4 TO RETURN-CODE
ELSE
MOVE 0 TO RETURN-CODE
END-IF
.
The Reconciliation Algorithm:
HSA-RECON uses the classic merge-compare pattern — the same pattern used in sequential file matching throughout the COBOL world. Both sources are sorted by claim ID. The program advances through both sources simultaneously, comparing claim IDs at each step:
- If both have the same claim ID, compare amounts and statuses (match or mismatch)
- If batch has a claim that real-time does not, it is a "batch-only" record
- If real-time has a claim that batch does not, it is a "real-time-only" record
This merge pattern is O(n) — it processes both sources in a single pass, regardless of volume.
📊 The 30-Day Gate. The reconciliation runs every night during the parallel-run period. The go-live criteria is 30 consecutive business days with zero mismatches, zero batch-only records, and zero real-time-only records. If any day shows a discrepancy, the counter resets to zero and investigation begins. This is a demanding criterion, but Priya insists: "We are moving money. Zero is the only acceptable error rate."
Phase 5: Monitoring and Alerting
The HSA-MONITOR Program
In production, real-time systems require active monitoring. Unlike batch, where you check the results in the morning, real-time problems must be detected and addressed immediately.
Priya designs a monitoring CICS transaction that runs every 5 minutes and checks system health:
IDENTIFICATION DIVISION.
PROGRAM-ID. HSA-MONITOR.
*================================================================*
* Program: HSA-MONITOR *
* Purpose: Real-time health monitoring for HSA event system *
* Schedule: Every 5 minutes via CICS interval control *
* Author: Priya Kapoor *
* Date: 2024-08-15 *
*================================================================*
DATA DIVISION.
WORKING-STORAGE SECTION.
01 WS-MONITOR-RESULTS.
05 WS-QUEUE-DEPTH PIC S9(08) COMP.
05 WS-DLQ-DEPTH PIC S9(08) COMP.
05 WS-OLDEST-MSG-AGE PIC S9(08) COMP.
05 WS-PROCESSED-LAST-5M PIC S9(08) COMP.
05 WS-ERRORS-LAST-5M PIC S9(08) COMP.
01 WS-THRESHOLDS.
05 WS-MAX-QUEUE-DEPTH PIC S9(08) COMP
VALUE 1000.
05 WS-MAX-DLQ-DEPTH PIC S9(08) COMP
VALUE 0.
05 WS-MAX-MSG-AGE-SEC PIC S9(08) COMP
VALUE 300.
05 WS-MIN-THROUGHPUT PIC S9(08) COMP
VALUE 10.
01 WS-ALERT-FLAG PIC X(01).
88 WS-ALERT-NEEDED VALUE 'Y'.
88 WS-NO-ALERT VALUE 'N'.
01 WS-ALERT-MESSAGE PIC X(200).
01 WS-RESP-CODE PIC S9(08) COMP.
EXEC SQL INCLUDE SQLCA END-EXEC.
PROCEDURE DIVISION.
0000-MAIN.
SET WS-NO-ALERT TO TRUE
PERFORM 1000-CHECK-QUEUE-DEPTH
PERFORM 2000-CHECK-DLQ
PERFORM 3000-CHECK-PROCESSING-RATE
PERFORM 4000-CHECK-ERROR-RATE
IF WS-ALERT-NEEDED
PERFORM 5000-SEND-ALERT
END-IF
* Reschedule for next interval
EXEC CICS START
TRANSID('HMON')
INTERVAL(000500)
RESP(WS-RESP-CODE)
END-EXEC
EXEC CICS RETURN END-EXEC
.
1000-CHECK-QUEUE-DEPTH.
* Check how many messages are waiting
* If depth > threshold, consumer may be down
EXEC CICS INQUIRE TDQUEUE('MEDCLAIM.HSA.PAYMENTS')
DEPTH(WS-QUEUE-DEPTH)
RESP(WS-RESP-CODE)
END-EXEC
IF WS-QUEUE-DEPTH > WS-MAX-QUEUE-DEPTH
SET WS-ALERT-NEEDED TO TRUE
STRING 'ALERT: Payment queue depth = '
DELIMITED BY SIZE
WS-QUEUE-DEPTH DELIMITED BY SIZE
' (threshold: '
DELIMITED BY SIZE
WS-MAX-QUEUE-DEPTH DELIMITED BY SIZE
')' DELIMITED BY SIZE
INTO WS-ALERT-MESSAGE
END-STRING
END-IF
.
2000-CHECK-DLQ.
* Any messages on the dead-letter queue need attention
EXEC CICS INQUIRE TDQUEUE(
'MEDCLAIM.HSA.PAYMENTS.DLQ')
DEPTH(WS-DLQ-DEPTH)
RESP(WS-RESP-CODE)
END-EXEC
IF WS-DLQ-DEPTH > WS-MAX-DLQ-DEPTH
SET WS-ALERT-NEEDED TO TRUE
STRING 'CRITICAL: Dead letter queue has '
DELIMITED BY SIZE
WS-DLQ-DEPTH DELIMITED BY SIZE
' messages - investigate immediately'
DELIMITED BY SIZE
INTO WS-ALERT-MESSAGE
END-STRING
END-IF
.
3000-CHECK-PROCESSING-RATE.
* Query DB2 for transactions in the last 5 minutes
EXEC SQL
SELECT COUNT(*)
INTO :WS-PROCESSED-LAST-5M
FROM GLOBALBANK.HSA_TRANSACTIONS
WHERE PROCESS_TS > CURRENT TIMESTAMP
- 5 MINUTES
END-EXEC
.
4000-CHECK-ERROR-RATE.
* Query DB2 for failed transactions in the last 5 minutes
EXEC SQL
SELECT COUNT(*)
INTO :WS-ERRORS-LAST-5M
FROM GLOBALBANK.HSA_TRANSACTIONS
WHERE PROCESS_TS > CURRENT TIMESTAMP
- 5 MINUTES
AND PROCESS_STATUS <> 'SUCCESS'
END-EXEC
IF WS-ERRORS-LAST-5M > 0
SET WS-ALERT-NEEDED TO TRUE
STRING 'WARNING: '
DELIMITED BY SIZE
WS-ERRORS-LAST-5M DELIMITED BY SIZE
' failed transactions in last 5 minutes'
DELIMITED BY SIZE
INTO WS-ALERT-MESSAGE
END-STRING
END-IF
.
5000-SEND-ALERT.
DISPLAY 'HSA-MONITOR: ' WS-ALERT-MESSAGE
* In production, this would send to an alerting system
* (email, Slack, PagerDuty, etc.) via MQ or API
.
Monitoring Metrics:
The monitoring program checks four health indicators every 5 minutes:
- Queue depth: If the payment queue has more than 1,000 messages waiting, the consumer may be down or overloaded.
- Dead-letter queue depth: Any messages on the DLQ indicate processing failures that need manual investigation.
- Processing rate: The number of successful transactions in the last 5 minutes. A sudden drop indicates a problem.
- Error rate: The number of failed transactions in the last 5 minutes. Any errors warrant investigation.
⚠️ Monitoring is Not Optional. Batch systems are self-monitoring in a sense: if the job fails, the operator sees it in the morning. Real-time systems can fail silently — messages accumulate on queues, error rates climb, and nobody notices until a customer complains. Active monitoring with automated alerting is essential for any real-time system.
Understanding the Reconciliation in Depth
Why Reconciliation is Hard
On the surface, reconciliation seems simple: compare two lists of transactions and report differences. In practice, it is one of the most challenging programs to write correctly, because of edge cases that do not appear in textbooks:
Timing mismatches. The batch path and real-time path may process the same claim at different times. A claim adjudicated at 11:55 PM might appear in the real-time log with a timestamp of 11:55 PM but in the batch extract for the following day. The reconciliation must handle cross-date matching.
Rounding differences. COBOL COMP-3 arithmetic and DB2 DECIMAL arithmetic use different internal representations. For most amounts, they produce identical results. But for certain calculations (especially those involving division), they may differ by one cent. The reconciliation must decide whether a one-cent difference constitutes a mismatch.
Order of operations. If two claims for the same member arrive simultaneously, the real-time path might process them in a different order than the batch path. The final balance should be the same, but the intermediate states differ. The reconciliation compares final results, not intermediate states.
Partial processing. If the real-time path processes 4,999 of 5,000 claims before MQ goes down, the reconciliation will show one "batch-only" record. This is not a bug — it is an expected consequence of the channel outage. The reconciliation report must distinguish between expected timing differences and actual processing errors.
The Reconciliation Algorithm in Detail
HSA-RECON uses the merge-compare algorithm. This algorithm requires both sources to be sorted by the same key (claim ID). The algorithm processes both sources in a single pass:
Initialize: Read first record from both sources
LOOP until both sources exhausted:
IF both sources have current record:
IF batch_claim_id = realtime_claim_id:
COMPARE amounts and statuses
ADVANCE both sources
ELSE IF batch_claim_id < realtime_claim_id:
REPORT "batch only" for batch record
ADVANCE batch source
ELSE:
REPORT "realtime only" for realtime record
ADVANCE realtime source
ELSE IF only batch has records:
REPORT "batch only" for remaining batch records
ADVANCE batch source
ELSE IF only realtime has records:
REPORT "realtime only" for remaining realtime records
ADVANCE realtime source
END LOOP
This is the same algorithm used in sequential file matching throughout COBOL — the merge pattern. It is O(n + m) where n and m are the sizes of the two sources. For 5,000 batch records and 5,000 real-time records, it requires at most 10,000 comparisons — far more efficient than the naive O(n * m) approach of searching one source for every record in the other.
Reconciliation Reporting
The reconciliation report serves multiple audiences:
Operations team: Needs a summary — how many matched, how many mismatched, are we clean today?
Development team: Needs details — which specific claims mismatched, what were the batch and real-time values, when were they processed?
Management: Needs trends — are mismatches decreasing over time? Are we on track for the 30-day gate?
Auditors: Needs complete history — every reconciliation result for every day of the parallel run, with the ability to drill into any specific mismatch.
The HSA-RECON report is designed for all four audiences: the summary at the bottom serves operations and management, the detail lines serve the development team, and the complete output file (written to a GDG for retention) serves auditors.
Handling Reconciliation Failures
When the reconciliation shows a mismatch, the investigation follows a standard procedure:
- Identify the claim. Use the claim ID from the reconciliation report.
- Check the batch audit trail. Find the claim in the batch audit file. Note the processing date, amount, and status.
- Check the DB2 transaction log. Find the claim in GLOBALBANK.HSA_TRANSACTIONS. Note the processing timestamp, amount, and status.
- Compare. Determine what differs: amount, status, or presence.
- Root cause. Common causes include: (a) timezone mismatch, (b) message delivery delay, (c) claim amended between batch extract and real-time processing, (d) bug in either path.
- Resolve. Fix the root cause. If necessary, manually adjust the affected account and update the transaction log.
During the parallel-run period, each mismatch resets the 30-day counter. This creates urgency: every mismatch must be investigated and resolved quickly, or the cutover date slips.
The 30-Day Gate: Statistical Confidence
Why 30 consecutive clean days? The number is not arbitrary. With 5,000 HSA transactions per day and a 30-day window, the parallel run processes approximately 150,000 transactions. If all 150,000 match, the team can say with high confidence that the real-time system produces identical results to the batch system.
Specifically, if the true error rate were 0.01% (one error per 10,000 transactions), the probability of seeing zero errors in 150,000 transactions is approximately 0.000003 — essentially zero. So 30 clean days with 5,000 daily transactions effectively proves the error rate is well below 0.01%.
If the daily volume were lower — say 100 transactions per day — 30 days would only test 3,000 transactions, and the confidence interval would be wider. For low-volume systems, a longer parallel-run period (60 or 90 days) would be appropriate.
Priya explains this to management using a simple analogy: "If you flip a coin 150,000 times and it comes up heads every time, you can be pretty confident it's not a fair coin. If our system processes 150,000 transactions with zero mismatches, we can be pretty confident it's working correctly."
Phase 6: Cutover Planning
The Cutover Sequence
After 30 consecutive clean reconciliation days, the team prepares for cutover — the moment when the batch path is disabled and the real-time path becomes the sole processing method.
Cutover Plan:
| Step | Time | Action | Responsibility | Rollback |
|---|---|---|---|---|
| 1 | T-7 days | Final parallel-run review | Priya | N/A |
| 2 | T-1 day | Notify all stakeholders | Sarah Kim | N/A |
| 3 | T-0, 6 PM | Disable batch flat file generation | James Okafor | Re-enable flat file |
| 4 | T-0, 6:15 PM | Verify MQ messages flowing | Derek Washington | Rollback to step 3 |
| 5 | T-0, 6:30 PM | Run 10-transaction smoke test | Maria Chen | Rollback to step 3 |
| 6 | T-0, 7 PM | Monitor first real-time batch | Priya | Full rollback |
| 7 | T+1, 8 AM | Morning reconciliation | Priya | Full rollback |
| 8 | T+1, 9 AM | Go/no-go decision | All leads | Full rollback |
| 9 | T+7 | Decommission batch flat file JCL | James Okafor | N/A |
| 10 | T+30 | Remove batch code from production | All | N/A |
Rollback Strategy:
The key to a safe cutover is a fast rollback. If anything goes wrong during or after cutover, the team can:
- Re-enable the batch flat file generation in CLM-ADJUD (a one-line JCL change)
- The next nightly batch will process all claims through the old path
- Real-time MQ messages continue to flow and are processed by HSA-EVENTS, but the reconciliation will show duplicates
- A cleanup program removes duplicate transactions from the DB2 table
🧪 Theme: Defensive Programming at the System Level. The cutover plan embodies defensive programming applied to an entire system. Every step has a defined rollback action. The rollback is tested before the cutover (a "rollback rehearsal" runs the previous weekend). The batch infrastructure is preserved for 30 days after cutover, not immediately decommissioned. This means the team can roll back to batch at any point during the first month if the real-time system proves unreliable.
Phase 7: The Cutover
Friday, 6:00 PM
The cutover begins. Both teams are on a conference bridge. James Okafor is at the MedClaim data center. Derek Washington and Maria Chen are at GlobalBank. Priya Kapoor is monitoring from her laptop, connected to both environments.
"Step 3: Disabling batch flat file generation," James announces. He modifies the CLM-ADJUD JCL to point the flat file DD statement to a DUMMY dataset. The program still runs — it still adjudicates claims — but it no longer writes the file that GlobalBank's batch reads.
"Step 4: Verifying MQ flow." Derek checks the queue depth. "I see 47 messages on MEDCLAIM.HSA.PAYMENTS. Messages are arriving."
"Step 5: Smoke test." Maria Chen submits 10 test claims through MedClaim's CICS interface. Within seconds, she sees the corresponding HSA debits in GlobalBank's DB2.
"All 10 processed. Amounts match. Confirmations received."
Priya checks the monitoring dashboard. Queue depth is stable. DLQ is empty. Processing rate is nominal.
"Step 6: Monitoring." Over the next two hours, 1,247 real-time HSA payments are processed. Zero errors. Average latency: 14 seconds from adjudication to HSA debit — well within the 30-second target.
Saturday, 8:00 AM
The morning reconciliation runs. Priya checks the results:
=======================================
HSA RECONCILIATION SUMMARY
=======================================
BATCH TRANSACTIONS: 0
REAL-TIME TRANSACTIONS: 4,891
MATCHES: 0
MISMATCHES: 0
BATCH-ONLY: 0
REAL-TIME-ONLY: 4,891
=======================================
All 4,891 transactions from the previous evening were processed exclusively through the real-time path. Zero batch transactions (as expected, since batch was disabled). Zero mismatches.
"We're clean," Priya tells the group.
Monday, 9:00 AM — Go/No-Go
After monitoring the weekend processing (12,340 real-time transactions, zero errors), the team meets for the go/no-go decision.
"The system processed 17,231 transactions over the weekend with zero errors, zero DLQ messages, and average latency of 12 seconds," Priya reports. "I recommend we proceed."
"Agreed," says James.
"Agreed," says Maria.
The cutover is declared successful. The batch flat file JCL will be decommissioned in 7 days. The batch programs will remain available (but unused) for 30 days.
Post-Cutover Operations
The First Week
The first week after cutover is the most critical. Even though the parallel run proved the system is correct under normal conditions, production inevitably brings conditions that testing did not anticipate.
Day 1 (Saturday): Low volume. 4,891 transactions processed. Zero errors. The team monitors continuously but finds nothing unusual.
Day 2 (Sunday): Even lower volume. 2,449 transactions. Zero errors. Derek notices that the average processing latency is 8 seconds — faster than during the parallel run because the batch system is no longer competing for DB2 locks.
Day 3 (Monday): First full business day. Volume spikes to 7,823 transactions — 50% higher than the average during parallel run. Peak hour (11 AM - 12 PM) processes 1,247 transactions. Queue depth briefly reaches 45 messages before the consumer catches up. Zero errors.
Day 4 (Tuesday): A new edge case appears. A partner hospital submits a claim with a diagnosis code that includes a special character (an en-dash instead of a hyphen). The JSON PARSE fails, and the message goes to the DLQ. HSA-MONITOR detects the DLQ message within 5 minutes and alerts the team. James traces the problem to the partner hospital's system generating non-standard characters. Fix: add a character validation step in the MedClaim producer before the JSON GENERATE.
Day 5 (Wednesday): MedClaim deploys the character validation fix. The DLQ message is manually reprocessed after correcting the diagnosis code. Zero new errors.
Day 6 (Thursday): Normal operations. 5,412 transactions. Zero errors. The team begins to relax slightly. Derek runs a DB2 performance report and confirms that buffer pool hit ratios are at 99.8% — the HSA tables are entirely cached in memory.
Day 7 (Friday): End of the first full business week. Total transactions for the week: 31,247. Total errors: 1 (the en-dash character issue, now fixed). Priya writes the first weekly operations report and distributes it to both management chains. The report includes performance metrics, error summaries, capacity utilization, and recommendations.
This is exactly why the batch infrastructure is preserved — the character validation bug would have been invisible in batch (the flat file would have contained the same character, and COBOL would have processed it without complaint). It only appeared because JSON has stricter character encoding rules than fixed-format flat files.
Capacity Planning for Growth
With the real-time system in production, Priya turns her attention to capacity planning. The system must handle projected growth without degradation.
Current capacity: - MQ: 100,000 message queue depth (20 days of buffer at current volume) - DB2: 500 transactions per second theoretical max (current peak: 3.5 per second) - CICS: 200 concurrent tasks (current peak: 12)
Projected growth: - MedClaim expects to add 3 new partner insurers in the next 12 months, each contributing ~2,000 transactions per day - Total projected daily volume in 12 months: 13,000 transactions - Total projected daily volume in 24 months: 20,000 transactions
At 20,000 transactions per day (peak hour ~2,500), the system is still well within capacity. The bottleneck, if one emerges, will be DB2 I/O — not MQ or CICS. Priya recommends monitoring DB2 buffer pool hit ratios and adjusting buffer pools if they fall below 95%.
Batch Decommissioning
After 30 days of clean production operation, the batch infrastructure is scheduled for decommissioning. But "decommission" does not mean "delete":
T+30 days: Remove the batch JCL steps from the production schedule. The JCL is archived (not deleted) in a "decommissioned" library.
T+60 days: Remove the batch program load modules from the production load library. They are archived in a backup library.
T+90 days: Review the decommissioned batch programs. If no issues have arisen in 90 days, the archive can be considered cold storage. It remains available but is no longer maintained.
T+365 days: Final review. If the real-time system has operated without reverting to batch for one full year, the batch archive can be formally retired. Even at this point, the code is not deleted — it is moved to long-term archive. In regulated industries like healthcare and banking, source code may need to be retained for 7+ years for audit purposes.
Maria Chen insists on this conservative timeline. "I've seen systems that ran perfectly for six months and then failed on year-end processing — because year-end volumes are three times normal, and the real-time system had never been tested at that scale. Keep the batch safety net until you've been through every seasonal peak."
Lessons from the Migration
Lesson 1: The Parallel Run is Everything
The 30-day parallel run was the most expensive phase of the project — it required running both systems simultaneously, building a reconciliation program, and investigating every discrepancy. It was also the most valuable phase. The parallel run found three bugs in the real-time system that would have caused production failures:
- A timezone issue where MedClaim's timestamp used Eastern time but GlobalBank's expected UTC
- A rounding difference where COMP-3 arithmetic and DB2 DECIMAL arithmetic produced slightly different results for certain amounts
- A message ordering issue where rapid-fire claims for the same member arrived out of order, causing optimistic lock failures
All three were found and fixed during the parallel run, before they could affect production.
Lesson 2: Idempotent Design Saves Lives
During the parallel run, there were several instances where MQ delivered duplicate messages (typically after a queue manager restart). Because HSA-EVENTS checked for duplicate message IDs before processing, these duplicates were silently ignored. Without idempotent design, each duplicate would have caused a double debit — taking money from a member's HSA account twice.
Lesson 3: Monitoring Must Be Built-In, Not Bolted On
The monitoring program (HSA-MONITOR) was built in Phase 5, before the parallel run. This meant the team had monitoring data from day one. They could see processing rates, error rates, and queue depths in real time, which made investigating reconciliation discrepancies much faster.
Lesson 4: Both Teams Must Understand Both Systems
Priya insisted that Derek Washington spend time learning MedClaim's system and that James Okafor spend time learning GlobalBank's. "You cannot debug a message that crosses organizational boundaries if you only understand half the path." This cross-training proved invaluable during the parallel run when the timezone bug required understanding both systems' date handling.
Lesson 5: Batch is Not the Enemy
💡 A Note on Distributed Transactions. The two-phase commit that CICS provides between DB2 and MQ is a luxury that many distributed systems do not have. If GlobalBank's consumer were a microservice running in a cloud environment, coordinating the database update and the message send would require the Outbox Pattern (write the message to a database table, then have a separate process read the table and send the message) or the Saga Pattern (a sequence of local transactions with compensating actions for rollback). The mainframe's integrated transaction manager makes this much simpler — but understanding the distributed alternatives helps you appreciate what CICS does behind the scenes.
Lesson 6: Design for Operations, Not Just Development
The monitoring program, the runbooks, the reconciliation reports — these are not afterthoughts. They are as important as the core processing programs. A system that works perfectly but cannot be monitored, debugged, or rolled back is a system that will eventually cause a production crisis.
Priya estimates that 30% of the project effort went into "operational infrastructure" — monitoring, reconciliation, alerting, runbook creation, and capacity planning. This ratio is typical for production-grade real-time systems. Development teams that allocate 100% of their time to "the application" and 0% to operations invariably pay for it later, usually at 3 AM on a Saturday.
Lesson 7: Cross-Training is a Risk Mitigation Strategy
At the end of the project, Derek Washington understands MedClaim's claim adjudication system nearly as well as James Okafor does. Maria Chen understands GlobalBank's HSA processing as well as Derek. This cross-training was not a nice-to-have — it was an explicit project deliverable.
Consider the alternative: if only James understands the MedClaim side and only Derek understands the GlobalBank side, a problem that spans both systems (like the timezone bug) requires both people to be available simultaneously. Cross-training reduces this dependency and improves the team's resilience.
Working with the Student Mainframe Lab
Simulating Message Queuing Without MQ
The Student Mainframe Lab does not have IBM MQ installed. But the core concepts of this capstone can be practiced using simulated message passing through sequential files or VSAM queues.
Approach 1: File-Based Message Simulation.
Replace the MQ PUT with a sequential file WRITE and the MQ GET with a sequential file READ. The "queue" is a sequential file. The "producer" writes JSON-formatted records; the "consumer" reads them.
* Simulated MQ PUT (producer side)
WRITE MSG-RECORD FROM WS-MQ-MESSAGE
AFTER ADVANCING 0 LINES
* Simulated MQ GET (consumer side)
READ MSG-FILE INTO WS-MQ-MESSAGE
AT END SET WS-QUEUE-EMPTY TO TRUE
END-READ
This approach loses the asynchronous, guaranteed-delivery properties of MQ, but it preserves the fundamental pattern: one program generates messages, another program consumes them, and the message format is JSON.
Approach 2: VSAM Queue Simulation.
Use a VSAM ESDS (Entry-Sequenced Data Set) as a message queue. The producer adds records to the end of the ESDS; the consumer reads from the beginning, tracking its position with a "cursor" record in a separate VSAM KSDS.
This approach more closely simulates MQ behavior because VSAM ESDS supports concurrent access from multiple programs — the producer can write while the consumer reads. It does not provide guaranteed delivery or dead-letter queue functionality, but it is a useful approximation.
GnuCOBOL Adaptations
For students using GnuCOBOL on their local machines:
- JSON GENERATE/PARSE: GnuCOBOL does not support the JSON GENERATE and JSON PARSE statements (these are IBM Enterprise COBOL v6+ features). Instead, build the JSON string manually using STRING:
STRING '{"messageId":"' DELIMITED BY SIZE
WS-MSG-ID DELIMITED BY SPACES
'","claimId":"' DELIMITED BY SIZE
WS-CLAIM-ID DELIMITED BY SPACES
'","amount":' DELIMITED BY SIZE
WS-AMOUNT-DISPLAY DELIMITED BY SPACES
'}' DELIMITED BY SIZE
INTO WS-JSON-OUTPUT
END-STRING
-
DB2 Queries: Replace EXEC SQL with file I/O against indexed files. The duplicate check becomes a VSAM KSDS READ by message ID; the account lookup becomes a VSAM KSDS READ by member ID.
-
CICS Commands: Replace EXEC CICS with standard batch processing. The trigger mechanism becomes a polling loop that checks for new records in the simulated queue file.
-
Monitoring: Replace the CICS interval control with a simple batch program that runs as a scheduled job (cron job on Linux, Task Scheduler on Windows). The program reads the simulated queue files and reports on message counts and processing statistics.
The key learning objective is not the specific APIs (MQ, DB2, CICS) but the patterns: event-driven design, idempotent processing, optimistic locking, reconciliation, and monitoring. These patterns apply regardless of the underlying technology.
Architectural Alternatives: What Else Could They Have Built?
The MQ-based event-driven architecture is not the only way to migrate from batch to real-time. Priya evaluated three alternatives before recommending MQ. Understanding why she chose MQ — and why the alternatives were rejected — provides valuable architectural perspective.
Alternative 1: Shared Database
Instead of message queuing, both organizations could share a single DB2 database. MedClaim writes adjudicated claims to a shared table; GlobalBank reads from the same table and processes HSA debits.
Advantages: Simple. No message infrastructure. No format conversion.
Disadvantages: Tight coupling — both organizations depend on the same database. A schema change by one organization breaks the other. Security is complex (both organizations need access to the same DB2 subsystem). Performance suffers because both organizations compete for the same database resources.
Priya rejected this approach because it creates a single point of failure that spans organizational boundaries. "If that database goes down, both organizations stop processing. With MQ, each organization can continue independent processing — messages accumulate and are processed when connectivity is restored."
Alternative 2: REST API with Polling
MedClaim exposes a REST API that returns adjudicated HSA-eligible claims. GlobalBank polls this API every few seconds, retrieves new claims, and processes them.
Advantages: Uses standard HTTP. Easy to implement with CICS web services (as demonstrated in Chapter 44). No message infrastructure required.
Disadvantages: Polling is wasteful — most requests return no new data. Latency is limited by the polling interval (if you poll every 30 seconds, average latency is 15 seconds). Error handling is complex — if GlobalBank's polling process fails, it must remember where it left off when it restarts. No guaranteed delivery — if a claim is adjudicated between polls and GlobalBank misses a poll cycle, the claim could be missed.
This approach would work for lower-volume, less critical integrations. For financial transactions where every claim must be processed exactly once, the lack of guaranteed delivery is a dealbreaker.
Alternative 3: Direct CICS-to-CICS Communication
CICS supports distributed program linking (DPL) — a CICS program on one LPAR can call a CICS program on another LPAR as if it were a local CALL. MedClaim's CLM-ADJUD could directly invoke GlobalBank's HSA-EVENTS using DPL.
Advantages: Synchronous — the adjudication waits for the HSA debit to complete, ensuring real-time confirmation. Simple — no message infrastructure, no reconciliation needed.
Disadvantages: Tight coupling — CLM-ADJUD cannot complete until HSA-EVENTS responds. If GlobalBank's CICS region is down, MedClaim's adjudication stops. Performance — the synchronous call adds latency to every adjudication, even for claims that are not HSA-eligible. Scalability — each concurrent adjudication ties up a CICS task on both LPARs.
Priya rejected this approach for its tight coupling. "If GlobalBank has a CICS outage, MedClaim stops adjudicating claims. That's 500,000 claims per month that would be delayed. The business cannot accept that risk."
Why MQ Won
MQ provides the best combination of loose coupling (each organization operates independently), guaranteed delivery (no message loss), and scalability (messages can be processed at the consumer's pace). The cost is complexity — MQ infrastructure, message format design, idempotent processing, reconciliation — but this complexity is manageable and well-understood on the mainframe platform.
"Every architecture is a set of tradeoffs," Priya tells the steering committee. "MQ trades simplicity for resilience. For financial transactions between two organizations, resilience wins."
Even after the cutover, batch processing continues to play a role. The reconciliation program is a batch job. The monitoring program's historical reports are batch jobs. The cleanup and archival of old transaction data are batch jobs. Real-time does not eliminate batch — it reduces the dependency on batch for time-sensitive operations.
🔴 Theme: Legacy != Obsolete. The batch programs that were "replaced" by real-time processing are still in the load library. They still work. If the real-time system experienced a catastrophic failure (MQ down, DB2 down, network outage), the batch path could be reactivated within minutes. The legacy batch system is not obsolete — it is a safety net. It earned that role through 18 years of reliable operation, and it will keep that role for as long as the real-time system needs a fallback.
✅ Theme: The Modernization Spectrum. This project moved the HSA payment system from one end of the modernization spectrum (batch, flat files, SFTP) to the other (real-time, MQ, DB2, JSON). But it did so incrementally: first the message infrastructure, then the producer, then the consumer, then the parallel run, then the monitoring, then the cutover. At every phase, the system was fully operational. At no point did the migration require downtime or data loss.
🔗 Theme: The Human Factor. The migration succeeded because two organizations trusted each other's teams. Maria Chen trusted James Okafor's MedClaim changes. James trusted Derek's GlobalBank changes. Priya bridged both teams. The technical architecture was well-designed, but the human architecture — trust, communication, shared understanding — was what made the project work.
Operational Runbooks
After cutover, Priya creates operational runbooks — step-by-step procedures for handling common and uncommon situations. These runbooks are essential because the event-driven system operates 24/7, and the on-call engineer may not be someone who built the system.
Runbook 1: DLQ Message Investigation
Trigger: HSA-MONITOR alert for DLQ depth > 0.
Steps:
- Connect to MQ Explorer and browse the DLQ (MEDCLAIM.HSA.PAYMENTS.DLQ).
- Examine the MQ message header — specifically the MQMD.BackoutCount and MQMD.Feedback fields. These indicate why MQ moved the message to the DLQ.
- View the message body. Is it valid JSON? If not, the problem is on the MedClaim producer side. Contact MedClaim operations.
- If the JSON is valid, attempt to identify the claim ID and member ID. Check whether the corresponding HSA account exists in GlobalBank's DB2.
- If the account does not exist, this is a data mismatch — MedClaim has an HSA-eligible member that GlobalBank does not recognize. Contact the HSA account team to investigate.
- If the account exists and the JSON is valid, the failure is likely a transient error that exhausted retries. Verify that the root cause (DB2 availability, CICS region status) has been resolved.
- Manually resubmit the message by moving it from the DLQ back to the main queue using the MQ
amqspututility or a custom resubmission program. - Monitor to confirm the resubmitted message is processed successfully.
- Document the incident in the HSA operations log.
Runbook 2: Queue Depth Growing
Trigger: HSA-MONITOR alert for queue depth > 1,000 or sustained growth.
Steps:
- Check the CICS region hosting HSA-EVENTS. Is the transaction running? Use CEMT I TRAN(HEVT) to verify.
- If the transaction is not running, investigate the CICS system log (CSMT) for ABENDs. Restart the transaction if appropriate.
- If the transaction is running but processing slowly, check DB2 performance. Use the DB2 Performance Monitor to look for lock contention, buffer pool misses, or long-running queries.
- If DB2 is healthy and the transaction is running, check MQ channel status. The channel connecting MedClaim and GlobalBank LPARs may be down.
- If all components are healthy but volume is higher than expected, this may be a legitimate spike (e.g., MedClaim re-adjudication batch). Monitor but do not act unless the queue depth exceeds MAXDEPTH.
- If MAXDEPTH is approaching, increase it temporarily using ALTER QLOCAL. Do not restart the queue manager.
Runbook 3: Emergency Rollback to Batch
Trigger: Real-time system is completely unavailable and cannot be restored within 4 hours.
Steps:
- Notify both GlobalBank and MedClaim management chains.
- On MedClaim's LPAR, restore the batch flat file JCL from the archive library. This is a JCL change — no program changes are needed.
- Run the evening batch cycle. All claims adjudicated since the real-time system failed will be included in the batch flat file.
- On GlobalBank's LPAR, verify that the batch HSA-PROC program is still in the production load library (it should be, per the decommissioning timeline).
- Run GlobalBank's batch cycle against the flat file.
- After both batches complete, run HSA-RECON to reconcile any transactions that were partially processed by the real-time system before the failure.
- Any duplicates (transactions processed by real-time before the failure AND by batch after the rollback) will appear in the reconciliation report. Process reversals as needed.
- Keep batch running until the real-time system is fully restored and tested.
- When the real-time system is restored, run a special reconciliation to verify it processes correctly before disabling batch again.
These runbooks are stored in the operations knowledge base and reviewed quarterly. Every new team member must walk through each runbook as part of their onboarding.
⚖️ Runbooks as Documentation of Design Intent. A good runbook does more than list steps — it explains why each step matters and what to look for. The DLQ investigation runbook, for example, distinguishes between data mismatches, transient errors, and producer-side problems. This classification helps the on-call engineer understand the system's design, not just its operation. Over time, runbooks become the primary way that system knowledge transfers from the builders to the operators.
Understanding the End-to-End Transaction Flow
To fully appreciate the migration from batch to real-time, it helps to trace a single transaction through the entire system, from the moment a doctor submits a claim to the moment the HSA debit appears on the member's account.
The Journey of Claim CLM000098765
9:15 AM — Claim Submission. A medical office submits a claim for patient Sarah Mitchell through MedClaim's CICS provider portal. The claim is for a routine office visit: diagnosis code J06.9 (upper respiratory infection), procedure code 99213 (established patient office visit), charged amount $175.00.
9:15:02 AM — Claim Receipt. MedClaim's CLM-INTAKE program (modernized in Chapter 44) receives the claim, validates the provider, validates the member, and writes it to the claims DB2 table with status 'RCV' (received).
9:15:05 AM — Adjudication. The CLM-ADJUD program processes the claim. It checks Sarah Mitchell's coverage: she has a MedClaim PPO plan with a $30 copay for office visits. The allowed amount for procedure 99213 under her plan is $150.00. After the $30 copay, MedClaim's payment to the provider is $120.00. The member's responsibility (charged amount minus allowed amount plus copay) is $55.00.
9:15:05 AM — HSA Eligibility Check. CLM-ADJUD checks whether Sarah Mitchell's member record indicates HSA eligibility. It does — she has a high-deductible health plan with a linked GlobalBank HSA. The member's out-of-pocket amount ($55.00) is eligible for HSA payment.
9:15:06 AM — MQ Message Published. CLM-ADJUD builds a JSON message containing the claim ID, member ID, payment amount ($55.00), and other details. It PUTs the message on the MEDCLAIM.HSA.PAYMENTS queue with message ID MSG20240315091506001.
9:15:06 AM — MQ Delivery. The IBM MQ channel between MedClaim and GlobalBank picks up the message. The message is transmitted across the private network connecting the two organizations. MQ acknowledges delivery to the MedClaim queue manager.
9:15:07 AM — Consumer Triggered. On GlobalBank's LPAR, the arrival of the message triggers the HSA-EVENTS CICS transaction. The CICS trigger monitor detects the message and starts the transaction.
9:15:07 AM — Duplicate Check. HSA-EVENTS reads the message and checks DB2: has a transaction with message ID MSG20240315091506001 already been processed? No — this is a new message.
9:15:07 AM — Account Lookup. HSA-EVENTS queries the HSA_ACCOUNTS table using Sarah Mitchell's member ID. It finds her HSA account: HSA0045678, current balance $3,805.00, status Active.
9:15:08 AM — Debit Applied. HSA-EVENTS updates Sarah Mitchell's HSA balance: $3,805.00 - $55.00 = $3,750.00. The UPDATE uses optimistic locking — it includes AND HSA_BALANCE = 3805.00 to ensure no other transaction has modified the balance since the SELECT.
9:15:08 AM — Transaction Recorded. HSA-EVENTS inserts a row in the HSA_TRANSACTIONS table recording the debit: message ID, claim ID, account ID, amount, new balance, status SUCCESS.
9:15:08 AM — Confirmation Sent. HSA-EVENTS generates a JSON confirmation message and PUTs it on the GLOBALBANK.HSA.CONFIRMS queue.
9:15:08 AM — SYNCPOINT. CICS commits the unit of work. The DB2 update, the DB2 insert, and the MQ PUT are all committed atomically. If any had failed, all would have been rolled back.
9:15:09 AM — Confirmation Received. On MedClaim's LPAR, the CLM-EVENTS program picks up the confirmation message. It updates the claim record in MedClaim's DB2 to reflect that the HSA payment has been processed.
Total elapsed time: 4 seconds — from claim adjudication to HSA debit confirmed.
Under the old batch system, this would have taken 24-48 hours. Sarah Mitchell would have seen the charge on her HSA account the next day (or the day after). Now she sees it in under 5 seconds — while she is still at the doctor's office.
This trace illustrates every major concept in this capstone: JSON messaging, MQ delivery, idempotent processing, optimistic locking, transactional messaging, and cross-organizational coordination. It also shows how the five programs in the real-time system (CLM-ADJUD, HSA-EVENTS, CLM-EVENTS, HSA-RECON, HSA-MONITOR) work together as a cohesive system — each program handling one responsibility, communicating through well-defined messages.
The Final Architecture
After cutover, the production system looks like this:
MedClaim LPAR GlobalBank LPAR
┌─────────────────┐ ┌─────────────────┐
│ │ │ │
│ CLM-ADJUD │ ──MQ──→ │ HSA-EVENTS │
│ (adjudicates │ │ (processes │
│ claims, puts │ │ HSA debits) │
│ MQ messages) │ │ │
│ │ ←──MQ── │ HSA-CONFIRM │
│ CLM-EVENTS │ │ (sends │
│ (processes │ │ confirmations)│
│ confirmations)│ │ │
│ │ │ HSA-MONITOR │
│ │ │ (health checks)│
│ │ │ │
│ │ │ HSA-RECON │
│ │ │ (reconciliation)│
│ │ │ │
└─────────────────┘ └─────────────────┘
Performance metrics after 30 days of production:
| Metric | Target | Actual |
|---|---|---|
| End-to-end latency | < 30 seconds | 12 seconds (avg) |
| Message delivery reliability | 100% | 100% |
| Processing accuracy | 100% | 100% |
| System availability | 99.9% | 99.97% |
| DLQ messages (30 days) | 0 | 0 |
| Reconciliation mismatches | 0 | 0 |
Summary: The Complete Journey
This capstone brought together every topic in this textbook:
- Data definition (Parts I-II): Copybooks, COMP-3 fields, 88-level conditions
- File processing (Parts III-IV): Sequential files, VSAM, DB2, file status handling
- Program design (Parts V-VI): Structure charts, subprograms, modular architecture
- CICS programming (Part VII): Online transactions, BMS maps, pseudo-conversational design
- Modern techniques (Part VIII): JSON, web services, MQ, event-driven architecture
- Testing and deployment (Part VIII): JCL, parallel runs, reconciliation, CI/CD
All five themes converge in this final capstone:
- Legacy != Obsolete: The batch system remains as a safety net; COBOL proves capable of modern real-time processing
- Readability is a Feature: Every program uses consistent naming, 88-levels, and clear structure
- The Modernization Spectrum: The migration was incremental, reversible, and delivered value at every phase
- Defensive Programming: Idempotent processing, duplicate checking, optimistic locking, dead-letter queues, rollback plans
- The Human Factor: Cross-organizational trust, team cross-training, and clear communication made the technical solution possible
The Three Capstones: A Retrospective
Looking back across all three capstones, a clear progression emerges — not just in technical complexity, but in what it means to be a professional COBOL developer.
Capstone 1: Learning to Build
In Capstone 1, Derek Washington built a banking system from scratch. He controlled every decision: the data design, the program structure, the error handling, the JCL. The system had no history, no legacy constraints, no competing stakeholders. This is the simplest kind of engineering — greenfield development with full autonomy.
The key lesson: building a system teaches you how the parts fit together. Before you can maintain, modernize, or migrate a system, you must understand how programs share data through copybooks, how JCL orchestrates job streams, how CICS provides online access, and how VSAM stores data. Capstone 1 taught those fundamentals.
Capstone 2: Learning to Improve
In Capstone 2, James Okafor modernized a legacy insurance system. He did not control the original design — he inherited it. He could not start over — the system was in production, processing half a million claims per month. Every change had to preserve existing behavior while improving maintainability, testability, and accessibility.
The key lesson: improving a system teaches you humility and discipline. The legacy code was not written by bad programmers — it was written by people solving problems with the tools they had at the time. James's job was not to judge the original design but to evolve it. The five-phase modernization (document, refactor, DB2, API, CI/CD) is a template that applies to any legacy system.
Capstone 3: Learning to Integrate
In Capstone 3, Priya Kapoor migrated a batch process to real-time event-driven processing across two organizations. She controlled neither the MedClaim system nor the GlobalBank system — she had to work within the constraints of both. The technical challenges (MQ messaging, idempotent processing, reconciliation) were significant, but the organizational challenges (cross-team trust, coordinated cutover, shared monitoring) were equally demanding.
The key lesson: integrating systems teaches you that technology is the easy part. Getting MQ to deliver messages is straightforward. Getting two organizations to agree on message formats, error handling procedures, cutover timing, and rollback criteria requires diplomacy, patience, and clear communication. Priya's role as the bridge between GlobalBank and MedClaim was as important as her technical design.
The Career Arc
These three capstones mirror a typical mainframe developer's career arc:
Years 1-2: Building and maintaining individual programs. Understanding copybooks, file handling, CICS, and JCL. This is Capstone 1 territory — learning the fundamentals by building things.
Years 3-7: Taking ownership of subsystems. Leading modernization efforts. Designing for testability and maintainability. This is Capstone 2 territory — improving existing systems while keeping them running.
Years 7+: Architecting cross-system integrations. Making technology decisions with multi-year implications. Mentoring junior developers. This is Capstone 3 territory — thinking beyond individual programs to systems of systems.
Derek Washington entered this textbook as a Capstone 1 developer. By participating in Priya's Capstone 3 project, he has glimpsed where his career can go. The path from "I can write a COBOL program" to "I can architect a cross-organizational real-time system" is long, but every step is built on the fundamentals.
Closing Thoughts
You began this textbook as a student who had completed a first COBOL course. You end it as someone who has designed a banking system from scratch, modernized a legacy insurance system, and migrated a batch process to real-time event-driven architecture.
These are not academic exercises. They are the kinds of projects that mainframe COBOL developers work on every day at banks, insurance companies, government agencies, and healthcare organizations around the world. The systems you have learned to build, maintain, and modernize process trillions of dollars, serve billions of people, and form the invisible infrastructure of modern society.
What You Have Learned
Take a moment to reflect on the breadth of knowledge you have acquired:
- Data design: COMP-3 packed decimal for monetary precision, 88-level conditions for self-documenting code, copybooks for shared definitions, FILLER bytes for forward compatibility
- File processing: Sequential files for batch I/O, VSAM KSDS for keyed access, DB2 for relational data, file status checking for defensive programming
- Program design: Structure charts for planning, subprograms for modularity, LINKAGE SECTION for parameter passing, the read-ahead pattern for EOF handling
- Online programming: CICS pseudo-conversational design, BMS maps for screen I/O, COMMAREA for conversation state, EXEC CICS for system services
- Modern integration: JSON GENERATE and JSON PARSE for web interoperability, IBM MQ for guaranteed message delivery, CICS web services for API exposure, event-driven architecture for real-time processing
- Testing and deployment: JCL job streams with conditional execution, GDGs for version management, parallel runs for migration safety, reconciliation for data verification, CI/CD for automated testing
- Professional practice: Code reviews, documentation, error handling, audit trails, capacity planning, operational runbooks, cross-team collaboration
Each of these topics could fill a textbook on its own. Together, they form the toolkit of a professional mainframe COBOL developer — someone who can not only write code but design systems, manage migrations, and make architectural decisions that affect entire organizations.
The Demand for COBOL Skills
As of this writing, the demand for COBOL developers exceeds the supply by a significant margin. Major banks, insurance companies, government agencies, and healthcare organizations are actively recruiting COBOL developers — not because they are nostalgic for the past, but because their mission-critical systems run on COBOL and need skilled people to maintain, modernize, and extend them.
The retirement wave among experienced COBOL developers is accelerating. Developers like Maria Chen (15+ years of experience) and James Okafor are approaching the later stages of their careers. The knowledge they carry — not just COBOL syntax, but deep understanding of business processes, system architecture, and operational practices — is at risk of being lost.
You are the solution to this problem. Every concept in this textbook, every program you have written, every design decision you have analyzed brings you closer to being the developer that these organizations need. The path from student to production-ready developer is not easy, but it is well-defined: learn the fundamentals, build complete systems, understand legacy code, and practice modern integration techniques. You have done all of these things.
A Final Word from the Team
COBOL is not a historical curiosity. It is a living, working language that powers the systems you depend on — whether you know it or not. The skills you have learned in this textbook are not just relevant today; they will be relevant for decades to come.
The project is complete. The real-time HSA system is in production. The batch safety net is in place. The monitoring is active. The runbooks are written. Both teams are cross-trained.
Priya Kapoor closes her laptop and looks at the project dashboard one last time. 150,000 transactions processed in the first 30 days. Zero errors. Zero data loss. 12-second average latency against a 30-second target.
She turns to the group on the conference bridge. "I want to thank everyone — James and the MedClaim team, Maria and Derek at GlobalBank. We migrated a financial processing system from batch to real-time across two organizations without a single production incident. That does not happen by accident. It happens because every person on this team did their job with discipline and care."
"And because the batch safety net never had to be used," Derek adds.
Maria Chen smiles. "The best safety net is the one you never need. But the second best safety net is the one that is there when you do."
As Maria tells Derek Washington at the end of the project: "You came here thinking you'd be working on old technology. Now you know — there's nothing old about building systems that the world depends on."