Case Study 1: Parsing Bank Transaction Descriptions at Meridian National Bank
Background
Meridian National Bank (MNB) processes 3.2 million debit card and ACH transactions daily. Each transaction arrives from the payment network with a free-form description field -- a 60-character string that contains the merchant name, location, reference number, and sometimes additional codes, all concatenated together with no consistent delimiter. These descriptions were designed for human readability on bank statements, not for machine processing.
In 2024, MNB launched an initiative to categorize transactions automatically for their mobile banking app. Customers wanted to see spending breakdowns by category (groceries, restaurants, gas stations, subscriptions), and the categorization engine needed structured fields extracted from the unstructured description text.
The challenge: the description field follows no formal standard. Different payment networks and merchants format descriptions differently:
| Raw Description | Merchant | Location | Reference |
|---|---|---|---|
WALMART SUPERCTR 5274 DALLAS TX REF#829471 |
WALMART SUPERCTR 5274 | DALLAS TX | 829471 |
SHELL OIL 57442 HOUSTON TX 77001 |
SHELL OIL 57442 | HOUSTON TX | (embedded in name) |
AMZN MKTP US*RT4K92HF0 AMZN.COM/BILLWA |
AMZN MKTP US | (online) | RT4K92HF0 |
SQ *BELLA ROSA CAFE CHICAGO IL |
BELLA ROSA CAFE | CHICAGO IL | (none) |
PAYPAL *NETFLIX.COM 402-935-7733 CA |
NETFLIX.COM | CA | (none) |
James Chen, a COBOL developer on the batch processing team, was assigned to build a transaction description parser using COBOL's string handling facilities: UNSTRING, INSPECT, STRING, reference modification, and intrinsic functions.
The Problem
James needed to extract three structured fields from each 60-character description:
- Merchant Name (up to 30 characters) -- The primary business name, stripped of location and reference data
- Merchant Location (up to 20 characters) -- City and state, if present
- Reference Number (up to 15 characters) -- Any reference, confirmation, or trace number
Additionally, the parser needed to: - Handle multiple description formats without failing - Normalize merchant names to uppercase with no leading/trailing spaces - Strip common prefixes like "SQ ", "PAYPAL ", and "AMZN MKTP US*" - Count the number of successfully parsed and unparseable records - Produce an output file with both the original and parsed fields for audit
The difficulty lies in the lack of consistent delimiters. Some descriptions use double spaces to separate sections, others use single spaces throughout. Some embed reference numbers after "REF#", others embed them after asterisks, and still others have no reference at all.
The Solution
Parsing Strategy
James developed a multi-pass parsing strategy:
- Pass 1 (INSPECT): Count and identify delimiter characters to determine the format type
- Pass 2 (UNSTRING): Split the description using the identified delimiters
- Pass 3 (INSPECT REPLACING): Strip known prefixes and normalize characters
- Pass 4 (Reference Modification): Extract specific substrings when positional patterns are detected
- Pass 5 (STRING): Reassemble the cleaned fields into the output record
The Complete COBOL Program
IDENTIFICATION DIVISION.
PROGRAM-ID. TXNPARSE.
AUTHOR. JAMES CHEN.
DATE-WRITTEN. 2024-08-20.
*================================================================
* PROGRAM: TXNPARSE
* PURPOSE: Parse free-form transaction descriptions into
* structured merchant name, location, and
* reference number fields. Demonstrates STRING,
* UNSTRING, INSPECT, and reference modification.
*================================================================
ENVIRONMENT DIVISION.
INPUT-OUTPUT SECTION.
FILE-CONTROL.
SELECT TRANS-INPUT-FILE
ASSIGN TO "TRANSIN"
ORGANIZATION IS SEQUENTIAL
FILE STATUS IS WS-INPUT-STATUS.
SELECT PARSED-OUTPUT-FILE
ASSIGN TO "PARSOUT"
ORGANIZATION IS SEQUENTIAL
FILE STATUS IS WS-OUTPUT-STATUS.
DATA DIVISION.
FILE SECTION.
FD TRANS-INPUT-FILE
RECORDING MODE IS F
RECORD CONTAINS 80 CHARACTERS.
01 FS-INPUT-RECORD.
05 FS-IN-ACCOUNT PIC 9(10).
05 FS-IN-TRANS-DATE PIC 9(8).
05 FS-IN-AMOUNT PIC S9(9)V99.
05 FS-IN-DESCRIPTION PIC X(60).
FD PARSED-OUTPUT-FILE
RECORDING MODE IS F
RECORD CONTAINS 160 CHARACTERS.
01 FS-OUTPUT-RECORD.
05 FS-OUT-ACCOUNT PIC 9(10).
05 FS-OUT-TRANS-DATE PIC 9(8).
05 FS-OUT-AMOUNT PIC S9(9)V99.
05 FS-OUT-ORIG-DESC PIC X(60).
05 FS-OUT-MERCHANT-NAME PIC X(30).
05 FS-OUT-MERCHANT-LOC PIC X(20).
05 FS-OUT-REFERENCE PIC X(15).
05 FS-OUT-PARSE-STATUS PIC X(1).
05 FILLER PIC X(4).
WORKING-STORAGE SECTION.
*----------------------------------------------------------------
* FILE STATUS FIELDS
*----------------------------------------------------------------
01 WS-INPUT-STATUS PIC X(2).
88 INPUT-OK VALUE "00".
88 INPUT-EOF VALUE "10".
01 WS-OUTPUT-STATUS PIC X(2).
88 OUTPUT-OK VALUE "00".
*----------------------------------------------------------------
* WORKING FIELDS FOR PARSING
*----------------------------------------------------------------
01 WS-WORK-DESC PIC X(60).
01 WS-MERCHANT-NAME PIC X(30).
01 WS-MERCHANT-LOC PIC X(20).
01 WS-REFERENCE PIC X(15).
*----------------------------------------------------------------
* UNSTRING RECEIVING FIELDS
*----------------------------------------------------------------
01 WS-UNSTR-FIELDS.
05 WS-PART-1 PIC X(30).
05 WS-PART-2 PIC X(20).
05 WS-PART-3 PIC X(20).
05 WS-PART-4 PIC X(15).
*----------------------------------------------------------------
* UNSTRING CONTROL FIELDS
*----------------------------------------------------------------
01 WS-UNSTR-PTR PIC 9(3).
01 WS-UNSTR-TALLY PIC 9(3).
01 WS-DELIM-1 PIC X(5).
01 WS-DELIM-2 PIC X(5).
*----------------------------------------------------------------
* INSPECT COUNTERS
*----------------------------------------------------------------
01 WS-ASTERISK-COUNT PIC 9(3) VALUE ZERO.
01 WS-HASH-COUNT PIC 9(3) VALUE ZERO.
01 WS-SLASH-COUNT PIC 9(3) VALUE ZERO.
01 WS-DOUBLE-SPACE-POS PIC 9(3) VALUE ZERO.
*----------------------------------------------------------------
* STATE CODE VALIDATION TABLE
*----------------------------------------------------------------
01 WS-STATE-CODES.
05 FILLER PIC X(100) VALUE
"AL AK AZ AR CA CO CT DE FL GA "
& "HI ID IL IN IA KS KY LA ME MD ".
05 FILLER PIC X(100) VALUE
"MA MI MN MS MO MT NE NV NH NJ "
& "NM NY NC ND OH OK OR PA RI SC ".
05 FILLER PIC X(50) VALUE
"SD TN TX UT VT VA WA WV WI WY ".
01 WS-STATE-TABLE REDEFINES WS-STATE-CODES.
05 WS-STATE-ENTRY PIC X(3)
OCCURS 50 TIMES.
01 WS-STATE-IDX PIC 9(3).
01 WS-FOUND-STATE PIC X(1).
88 STATE-FOUND VALUE 'Y'.
88 STATE-NOT-FOUND VALUE 'N'.
*----------------------------------------------------------------
* REFERENCE MODIFICATION WORK FIELDS
*----------------------------------------------------------------
01 WS-SCAN-POS PIC 9(3).
01 WS-SCAN-LEN PIC 9(3).
01 WS-REF-START PIC 9(3).
01 WS-REF-LEN PIC 9(3).
01 WS-CHAR PIC X(1).
*----------------------------------------------------------------
* PREFIX REMOVAL TABLE
*----------------------------------------------------------------
01 WS-PREFIX-TABLE.
05 WS-PREFIX-COUNT PIC 9(2) VALUE 6.
05 WS-PREFIX-DATA.
10 FILLER PIC X(15) VALUE "SQ * ".
10 FILLER PIC X(15) VALUE "PAYPAL * ".
10 FILLER PIC X(15) VALUE "AMZN MKTP US* ".
10 FILLER PIC X(15) VALUE "TST* ".
10 FILLER PIC X(15) VALUE "SP * ".
10 FILLER PIC X(15) VALUE "CKE* ".
05 WS-PREFIX-ENTRIES REDEFINES WS-PREFIX-DATA.
10 WS-PREFIX-ENTRY PIC X(15)
OCCURS 6 TIMES.
01 WS-PREFIX-IDX PIC 9(2).
01 WS-PREFIX-LEN PIC 9(2).
*----------------------------------------------------------------
* COUNTERS AND STATISTICS
*----------------------------------------------------------------
01 WS-COUNTERS.
05 WS-TOTAL-READ PIC S9(7) COMP-3
VALUE ZERO.
05 WS-TOTAL-PARSED PIC S9(7) COMP-3
VALUE ZERO.
05 WS-TOTAL-PARTIAL PIC S9(7) COMP-3
VALUE ZERO.
05 WS-TOTAL-FAILED PIC S9(7) COMP-3
VALUE ZERO.
05 WS-TOTAL-WRITTEN PIC S9(7) COMP-3
VALUE ZERO.
*----------------------------------------------------------------
* DISPLAY FIELDS
*----------------------------------------------------------------
01 WS-DISP-COUNT PIC Z,ZZZ,ZZ9.
PROCEDURE DIVISION.
0000-MAIN-CONTROL.
PERFORM 1000-INITIALIZE
PERFORM 2000-PROCESS-TRANSACTIONS
UNTIL INPUT-EOF
PERFORM 8000-DISPLAY-STATISTICS
PERFORM 9000-FINALIZE
STOP RUN
.
1000-INITIALIZE.
DISPLAY "========================================"
DISPLAY " TRANSACTION DESCRIPTION PARSER"
DISPLAY " MERIDIAN NATIONAL BANK"
DISPLAY "========================================"
OPEN INPUT TRANS-INPUT-FILE
OUTPUT PARSED-OUTPUT-FILE
IF NOT INPUT-OK
DISPLAY "ERROR: Cannot open input file. "
"Status: " WS-INPUT-STATUS
STOP RUN
END-IF
IF NOT OUTPUT-OK
DISPLAY "ERROR: Cannot open output file. "
"Status: " WS-OUTPUT-STATUS
STOP RUN
END-IF
PERFORM 2100-READ-TRANSACTION
.
2000-PROCESS-TRANSACTIONS.
ADD 1 TO WS-TOTAL-READ
INITIALIZE WS-MERCHANT-NAME
INITIALIZE WS-MERCHANT-LOC
INITIALIZE WS-REFERENCE
INITIALIZE WS-UNSTR-FIELDS
MOVE FS-IN-DESCRIPTION TO WS-WORK-DESC
* --- Pass 1: Analyze the description format ---
PERFORM 3000-ANALYZE-FORMAT
* --- Pass 2: Remove known prefixes ---
PERFORM 3100-REMOVE-PREFIXES
* --- Pass 3: Extract reference number ---
PERFORM 3200-EXTRACT-REFERENCE
* --- Pass 4: Extract location ---
PERFORM 3300-EXTRACT-LOCATION
* --- Pass 5: Extract merchant name ---
PERFORM 3400-EXTRACT-MERCHANT-NAME
* --- Build output record ---
PERFORM 4000-BUILD-OUTPUT
PERFORM 4100-WRITE-OUTPUT
PERFORM 2100-READ-TRANSACTION
.
2100-READ-TRANSACTION.
READ TRANS-INPUT-FILE
AT END
SET INPUT-EOF TO TRUE
END-READ
.
3000-ANALYZE-FORMAT.
* -------------------------------------------------------
* Use INSPECT to count delimiter characters in the
* description. This determines the parsing strategy.
* -------------------------------------------------------
MOVE ZERO TO WS-ASTERISK-COUNT
MOVE ZERO TO WS-HASH-COUNT
MOVE ZERO TO WS-SLASH-COUNT
INSPECT WS-WORK-DESC
TALLYING WS-ASTERISK-COUNT FOR ALL "*"
WS-HASH-COUNT FOR ALL "#"
WS-SLASH-COUNT FOR ALL "/"
.
3100-REMOVE-PREFIXES.
* -------------------------------------------------------
* Check if the description starts with a known prefix
* (SQ *, PAYPAL *, etc.) and remove it using reference
* modification.
* -------------------------------------------------------
PERFORM VARYING WS-PREFIX-IDX FROM 1 BY 1
UNTIL WS-PREFIX-IDX > WS-PREFIX-COUNT
* Determine the actual length of this prefix
* (excluding trailing spaces in the table entry)
MOVE ZERO TO WS-PREFIX-LEN
INSPECT WS-PREFIX-ENTRY(WS-PREFIX-IDX)
TALLYING WS-PREFIX-LEN
FOR CHARACTERS BEFORE INITIAL " "
* If prefix length is valid, check for a match
IF WS-PREFIX-LEN > 0 AND WS-PREFIX-LEN < 15
IF WS-WORK-DESC(1:WS-PREFIX-LEN) =
WS-PREFIX-ENTRY(WS-PREFIX-IDX)
(1:WS-PREFIX-LEN)
* Shift the description left, removing
* the prefix
MOVE SPACES TO WS-MERCHANT-NAME
COMPUTE WS-SCAN-LEN =
60 - WS-PREFIX-LEN
MOVE WS-WORK-DESC
(WS-PREFIX-LEN + 1:WS-SCAN-LEN)
TO WS-WORK-DESC
END-IF
END-IF
END-PERFORM
.
3200-EXTRACT-REFERENCE.
* -------------------------------------------------------
* Look for reference number patterns:
* 1. "REF#" followed by digits
* 2. "CONF#" followed by digits
* 3. Asterisk-delimited reference in certain formats
* Uses reference modification to scan and extract.
* -------------------------------------------------------
MOVE SPACES TO WS-REFERENCE
* Pattern 1: Look for "REF#"
IF WS-HASH-COUNT > 0
PERFORM VARYING WS-SCAN-POS FROM 1 BY 1
UNTIL WS-SCAN-POS > 55
IF WS-WORK-DESC(WS-SCAN-POS:4) = "REF#"
COMPUTE WS-REF-START =
WS-SCAN-POS + 4
COMPUTE WS-REF-LEN =
60 - WS-REF-START + 1
IF WS-REF-LEN > 15
MOVE 15 TO WS-REF-LEN
END-IF
MOVE WS-WORK-DESC
(WS-REF-START:WS-REF-LEN)
TO WS-REFERENCE
* Remove the REF# and number from the
* working description
MOVE SPACES TO
WS-WORK-DESC(WS-SCAN-POS:
60 - WS-SCAN-POS + 1)
END-IF
END-PERFORM
END-IF
* Pattern 2: Look for "CONF#"
IF WS-REFERENCE = SPACES AND WS-HASH-COUNT > 0
PERFORM VARYING WS-SCAN-POS FROM 1 BY 1
UNTIL WS-SCAN-POS > 54
IF WS-WORK-DESC(WS-SCAN-POS:5) = "CONF#"
COMPUTE WS-REF-START =
WS-SCAN-POS + 5
COMPUTE WS-REF-LEN =
60 - WS-REF-START + 1
IF WS-REF-LEN > 15
MOVE 15 TO WS-REF-LEN
END-IF
MOVE WS-WORK-DESC
(WS-REF-START:WS-REF-LEN)
TO WS-REFERENCE
MOVE SPACES TO
WS-WORK-DESC(WS-SCAN-POS:
60 - WS-SCAN-POS + 1)
END-IF
END-PERFORM
END-IF
.
3300-EXTRACT-LOCATION.
* -------------------------------------------------------
* Scan the description from right to left looking for
* a two-letter US state code preceded by a city name.
* Uses the state code validation table.
* -------------------------------------------------------
MOVE SPACES TO WS-MERCHANT-LOC
SET STATE-NOT-FOUND TO TRUE
* Scan from position 58 backward (state code is 2 chars)
PERFORM VARYING WS-SCAN-POS FROM 58 BY -1
UNTIL WS-SCAN-POS < 10 OR STATE-FOUND
* Check if this position has a potential state code
* (preceded by a space)
IF WS-SCAN-POS > 1
IF WS-WORK-DESC(WS-SCAN-POS - 1:1) = SPACE
PERFORM VARYING WS-STATE-IDX
FROM 1 BY 1
UNTIL WS-STATE-IDX > 50
OR STATE-FOUND
IF WS-WORK-DESC
(WS-SCAN-POS:2) =
WS-STATE-ENTRY(WS-STATE-IDX)
(1:2)
SET STATE-FOUND TO TRUE
* Extract city + state
* Scan backward from state to
* find start of city
PERFORM
3310-EXTRACT-CITY-STATE
END-IF
END-PERFORM
END-IF
END-IF
END-PERFORM
.
3310-EXTRACT-CITY-STATE.
* WS-SCAN-POS points to the state code.
* Walk backward to find the start of the city name.
MOVE WS-SCAN-POS TO WS-REF-START
COMPUTE WS-REF-START = WS-SCAN-POS - 2
PERFORM VARYING WS-REF-START
FROM WS-REF-START BY -1
UNTIL WS-REF-START < 2
IF WS-WORK-DESC(WS-REF-START:1) = SPACE
AND WS-WORK-DESC(WS-REF-START - 1:1)
NOT = SPACE
* Found the space before the city name
* but need to check the character before
* is part of the merchant name
CONTINUE
ELSE IF WS-WORK-DESC(WS-REF-START:1) = SPACE
AND WS-WORK-DESC(WS-REF-START - 1:1)
= SPACE
* Found double space -- city starts after it
ADD 1 TO WS-REF-START
EXIT PERFORM
END-IF
END-PERFORM
* Extract the location substring
COMPUTE WS-REF-LEN =
WS-SCAN-POS + 2 - WS-REF-START
IF WS-REF-LEN > 0 AND WS-REF-LEN <= 20
MOVE WS-WORK-DESC(WS-REF-START:WS-REF-LEN)
TO WS-MERCHANT-LOC
* Blank out the location from working description
MOVE SPACES TO
WS-WORK-DESC(WS-REF-START:WS-REF-LEN)
END-IF
.
3400-EXTRACT-MERCHANT-NAME.
* -------------------------------------------------------
* Whatever remains in the working description after
* removing the reference and location is the merchant
* name. Use INSPECT to clean it up.
* -------------------------------------------------------
* Convert to uppercase
MOVE FUNCTION UPPER-CASE(WS-WORK-DESC)
TO WS-WORK-DESC
* Replace multiple consecutive spaces with single spaces
* by using INSPECT REPLACING
INSPECT WS-WORK-DESC
REPLACING ALL " " BY " " & SPACE
* Extract the first 30 non-blank characters
MOVE SPACES TO WS-MERCHANT-NAME
MOVE 1 TO WS-UNSTR-PTR
UNSTRING WS-WORK-DESC
DELIMITED BY ALL SPACES
INTO WS-PART-1
WS-PART-2
WS-PART-3
WITH POINTER WS-UNSTR-PTR
TALLYING IN WS-UNSTR-TALLY
END-UNSTRING
* Reassemble with single spaces between words
MOVE SPACES TO WS-MERCHANT-NAME
MOVE 1 TO WS-UNSTR-PTR
STRING WS-PART-1 DELIMITED BY " "
" " DELIMITED BY SIZE
WS-PART-2 DELIMITED BY " "
" " DELIMITED BY SIZE
WS-PART-3 DELIMITED BY " "
INTO WS-MERCHANT-NAME
WITH POINTER WS-UNSTR-PTR
ON OVERFLOW
CONTINUE
END-STRING
.
4000-BUILD-OUTPUT.
* -------------------------------------------------------
* Assemble the output record from parsed fields.
* Determine parse quality status.
* -------------------------------------------------------
MOVE FS-IN-ACCOUNT TO FS-OUT-ACCOUNT
MOVE FS-IN-TRANS-DATE TO FS-OUT-TRANS-DATE
MOVE FS-IN-AMOUNT TO FS-OUT-AMOUNT
MOVE FS-IN-DESCRIPTION TO FS-OUT-ORIG-DESC
MOVE WS-MERCHANT-NAME TO FS-OUT-MERCHANT-NAME
MOVE WS-MERCHANT-LOC TO FS-OUT-MERCHANT-LOC
MOVE WS-REFERENCE TO FS-OUT-REFERENCE
* Determine parse status
IF WS-MERCHANT-NAME NOT = SPACES
IF WS-MERCHANT-LOC NOT = SPACES
MOVE "F" TO FS-OUT-PARSE-STATUS
ADD 1 TO WS-TOTAL-PARSED
ELSE
MOVE "P" TO FS-OUT-PARSE-STATUS
ADD 1 TO WS-TOTAL-PARTIAL
END-IF
ELSE
MOVE "X" TO FS-OUT-PARSE-STATUS
ADD 1 TO WS-TOTAL-FAILED
END-IF
.
4100-WRITE-OUTPUT.
WRITE FS-OUTPUT-RECORD
IF OUTPUT-OK
ADD 1 TO WS-TOTAL-WRITTEN
ELSE
DISPLAY "ERROR: Write failed. Status: "
WS-OUTPUT-STATUS
END-IF
.
8000-DISPLAY-STATISTICS.
DISPLAY " "
DISPLAY "========================================"
DISPLAY " PARSING STATISTICS"
DISPLAY "========================================"
MOVE WS-TOTAL-READ TO WS-DISP-COUNT
DISPLAY " Records read: " WS-DISP-COUNT
MOVE WS-TOTAL-PARSED TO WS-DISP-COUNT
DISPLAY " Fully parsed: " WS-DISP-COUNT
MOVE WS-TOTAL-PARTIAL TO WS-DISP-COUNT
DISPLAY " Partially parsed: " WS-DISP-COUNT
MOVE WS-TOTAL-FAILED TO WS-DISP-COUNT
DISPLAY " Failed to parse: " WS-DISP-COUNT
MOVE WS-TOTAL-WRITTEN TO WS-DISP-COUNT
DISPLAY " Records written: " WS-DISP-COUNT
DISPLAY "========================================"
.
9000-FINALIZE.
CLOSE TRANS-INPUT-FILE
PARSED-OUTPUT-FILE
.
Solution Walkthrough
Pass 1: Format Analysis with INSPECT TALLYING
The first step uses INSPECT TALLYING to count delimiter characters without modifying the source string. This non-destructive analysis determines which parsing patterns to apply:
INSPECT WS-WORK-DESC
TALLYING WS-ASTERISK-COUNT FOR ALL "*"
WS-HASH-COUNT FOR ALL "#"
WS-SLASH-COUNT FOR ALL "/"
A single INSPECT statement counts three different characters simultaneously. If WS-HASH-COUNT > 0, the description likely contains a "REF#" or "CONF#" pattern. If WS-ASTERISK-COUNT > 0, it may be a Square, PayPal, or Amazon-format description with prefix delimiters.
Pass 2: Prefix Removal with Reference Modification
Known prefixes like "SQ " and "PAYPAL " are stored in a table rather than hard-coded in IF statements. The removal logic uses reference modification to compare and shift:
IF WS-WORK-DESC(1:WS-PREFIX-LEN) =
WS-PREFIX-ENTRY(WS-PREFIX-IDX)(1:WS-PREFIX-LEN)
MOVE WS-WORK-DESC(WS-PREFIX-LEN + 1:WS-SCAN-LEN)
TO WS-WORK-DESC
END-IF
The expression WS-WORK-DESC(1:WS-PREFIX-LEN) extracts exactly WS-PREFIX-LEN characters from position 1 -- a substring comparison without using UNSTRING. When a match is found, the description is shifted left by moving the portion after the prefix to the beginning of the field.
Pass 3: Reference Number Extraction
The reference extraction scans the description using a sliding window implemented with reference modification:
IF WS-WORK-DESC(WS-SCAN-POS:4) = "REF#"
This tests four characters starting at WS-SCAN-POS. When "REF#" is found, everything after it (up to 15 characters) is captured as the reference number, and that portion of the working description is blanked out to prevent it from contaminating the merchant name extraction.
Pass 4: Location Extraction with State Code Validation
The location extraction demonstrates a right-to-left scan -- a technique that is harder to implement with UNSTRING (which always scans left to right) but natural with reference modification. The program scans backward through the description looking for a valid two-character US state code preceded by a space. When found, it walks further backward to find the city name.
Pass 5: Name Assembly with STRING
After removing the reference and location, the remaining text is the merchant name. The program uses UNSTRING to split it into words (delimited by spaces), then STRING to reassemble the words with exactly one space between them. This eliminates the multiple consecutive spaces that often appear in transaction descriptions.
STRING WS-PART-1 DELIMITED BY " "
" " DELIMITED BY SIZE
WS-PART-2 DELIMITED BY " "
" " DELIMITED BY SIZE
WS-PART-3 DELIMITED BY " "
INTO WS-MERCHANT-NAME
The DELIMITED BY " " (double space) stops each part at its first double space, effectively trimming trailing blanks. The literal " " with DELIMITED BY SIZE inserts exactly one space between parts.
Lessons Learned
1. INSPECT Is the Swiss Army Knife of COBOL String Analysis
INSPECT TALLYING counts characters without modifying the source, making it safe for preliminary analysis. INSPECT REPLACING modifies characters in place, ideal for normalization. Using both together -- first to analyze, then to transform -- is a powerful pattern.
2. Reference Modification Fills Gaps That UNSTRING Cannot
UNSTRING scans left to right and requires a known delimiter. When you need to scan right to left, test for a pattern at a specific position, or extract a substring of computed length, reference modification is the right tool. The state-code scan in this program could not have been written with UNSTRING alone.
3. Multi-Pass Parsing Is More Maintainable Than Single-Pass
By separating prefix removal, reference extraction, location extraction, and name assembly into distinct passes, each pass can be understood, tested, and modified independently. A single-pass approach that tried to extract all fields simultaneously would be far more complex and fragile.
4. Table-Driven Prefix Removal Scales Better Than Hard-Coded IF Chains
Storing the known prefixes in a table means adding a new prefix requires only a table entry change. With hard-coded IF statements, every new prefix format requires a code change, recompilation, and testing.
5. STRING's ON OVERFLOW Is Essential for Variable-Length Assembly
When assembling a merchant name from multiple parts, the total length may exceed the 30-character target field. The ON OVERFLOW clause handles this gracefully by stopping the transfer without causing a program abend, preserving whatever portion of the name fits.
Discussion Questions
-
The program uses PERFORM VARYING to scan character by character through the description. COBOL does not have a built-in "indexOf" function. How does reference modification compensate for this? What would the performance impact be on 3.2 million daily transactions?
-
The state code validation uses a table of 50 entries searched sequentially. How would you optimize this for performance? Could you use SEARCH ALL, and if so, what changes to the table definition would be required?
-
The parser handles "REF#" and "CONF#" patterns but not all possible reference number formats. How would you extend it to detect reference numbers that follow no keyword pattern (such as alphanumeric codes embedded after asterisks)? What is the risk of false positives?
-
The UNSTRING statement uses
DELIMITED BY ALL SPACESto split the merchant name into words. What is the difference betweenDELIMITED BY SPACEandDELIMITED BY ALL SPACES? What would happen to the output if the wrong form were used? -
The program blanks out extracted sections of the working description to prevent them from appearing in the merchant name. What would happen if the extractions were done in a different order (for example, merchant name first, then location)? Why does order matter?
-
The output record includes both the original description and the parsed fields. Why is this important for a production system? How would this support a reconciliation or audit process?
-
The STRING statement's POINTER phrase is initialized to 1 before assembly. What would happen if the programmer forgot to initialize it? How does the WITH POINTER clause interact with multiple STRING statements in sequence?