> "The code is the documentation. Unfortunately, the code was written in 1987 by someone who thought comments were a waste of punch cards."
In This Chapter
- 41.1 The Reality of Legacy COBOL
- 41.2 A Systematic Approach to Code Reading
- 41.3 Data Flow Analysis
- 41.4 Impact Analysis
- 41.5 Reverse Engineering Business Rules
- 41.6 Documentation Recovery
- 41.7 Tools for Code Archaeology
- 41.8 Common Legacy Patterns and Anti-Patterns
- 41.9 Working with Tribal Knowledge
- 41.10 GlobalBank: Archaeology of a 1987 Module
- 41.11 MedClaim: Reverse Engineering Adjudication Rules
- 41.12 Try It Yourself: Analyzing an Unknown Program
- 41.13 Call Graph Construction Methods
- 41.14 Data Dictionary Recovery
- 41.15 COBOL Cross-Reference Analysis
- 41.16 Working with Tribal Knowledge: Advanced Techniques
- 41.17 Try It Yourself: Building a Cross-Reference Report
- 41.18 Documentation Templates for Legacy Systems
- 41.19 MedClaim: The Full Archaeology Report
- 41.20 Dead Code Detection and Removal
- 41.21 GlobalBank: The Retirement Knowledge Transfer
- 41.22 Chapter Summary
Chapter 41: Legacy Code Archaeology
"The code is the documentation. Unfortunately, the code was written in 1987 by someone who thought comments were a waste of punch cards." — Maria Chen, opening a 4,200-line program with three comment lines
Every COBOL developer, at some point in their career, will face this moment: a production problem emerges in a program that nobody currently on the team wrote, nobody fully understands, and nobody has touched in years. The original developer retired. The documentation, if it ever existed, is either missing or so outdated that it describes a version of the program that no longer exists. The program works — it has been working for decades — but now something needs to change, and someone needs to understand what this code actually does.
That someone is you.
Legacy code archaeology is the systematic process of understanding undocumented code. It is not random reading — it is structured investigation. You do not start at line 1 and read to line 5,000. You start with questions: What does this program do? What files does it touch? What are the inputs and outputs? Where are the business rules? Then you use specific techniques to answer those questions efficiently.
This chapter teaches you those techniques. By the end, you will be able to pick up any COBOL program — no matter how old, how long, or how uncommented — and systematically extract an understanding of its purpose, its logic, and its behavior.
41.1 The Reality of Legacy COBOL
Let us be honest about what you will encounter. The average age of production COBOL code is somewhere between 25 and 40 years old. Much of it was written before structured programming was widely adopted in COBOL shops. Much of it was modified dozens of times by dozens of developers, each of whom had their own style and their own understanding (or misunderstanding) of the original design.
What You Will Find
💡 Common Characteristics of Legacy COBOL:
- Paragraph names like 2000-PROCESS and 3000-PROCESSING (what is the difference?)
- Variables named WS-WRK-FLD-1 through WS-WRK-FLD-47
- GO TO statements creating spaghetti control flow
- PERFORM THRU paragraphs with fall-through logic
- Nested IF statements 10 levels deep with no scope terminators (pre-COBOL-85)
- Commented-out code that may or may not be relevant
- Multiple REDEFINES on the same data area for different record types
- COPY members that have been modified in place (different versions in different libraries)
What You Will Not Find
- Inline comments explaining business rules
- A design document that matches the current code
- Unit tests
- A developer who remembers why paragraph 4700-SPECIAL-CALC exists
⚠️ Critical Mindset: Do not judge the original developers. They wrote this code under constraints you may not understand — tight deadlines, limited disk space, compiler limitations, performance requirements that forced certain design choices. The goal is not to criticize the code but to understand it.
41.2 A Systematic Approach to Code Reading
Random reading is the enemy of understanding. When you open a 5,000-line COBOL program, reading from top to bottom is like reading a novel by starting at page 200 — you will understand individual sentences but miss the story. Instead, follow this systematic approach.
Step 1: Identify the Program's Purpose
Start with the external evidence — everything except the code itself:
- JCL: What files does the JCL assign? What DD names are used? What is the job name?
- Program name: Does the name suggest a function? (CLMADJ = claim adjudication, RPTMTLY = monthly report)
- File names: Input and output dataset names often reveal purpose
- Scheduling: When does this job run? What runs before and after it?
//CLMADJ EXEC PGM=CLMADJ01
//STEPLIB DD DSN=MEDCL.PROD.LOADLIB,DISP=SHR
//CLMIN DD DSN=MEDCL.CLAIMS.PENDING,DISP=SHR
//CLMOUT DD DSN=MEDCL.CLAIMS.ADJUDICATED,...
//PAYTBL DD DSN=MEDCL.PAYMENT.SCHEDULE,DISP=SHR
//ERRRPT DD SYSOUT=*
//CTLTOT DD DSN=MEDCL.CTL.CLMADJ,...
From this JCL alone, you can deduce: this program reads pending claims (CLMIN), produces adjudicated claims (CLMOUT), uses a payment schedule table (PAYTBL), writes error reports (ERRRPT), and produces control totals (CTLTOT). You know what it does before reading a single line of COBOL.
Step 2: Map the Data Division
The DATA DIVISION tells you what the program works with. Focus on:
File Section: Identify all files — their record layouts tell you what data flows in and out.
Working-Storage: Identify the major groups: - Control flags and switches (88-level items) - Counters and accumulators - Work areas and intermediate fields - Tables and arrays (OCCURS) - Constants and parameters
* Look for meaningful group names:
01 WS-CLAIM-WORK-AREA. (claim processing fields)
01 WS-PAYMENT-CALC. (payment calculation)
01 WS-ERROR-HANDLING. (error management)
01 WS-CONTROL-TOTALS. (control totals)
01 WS-FLAGS. (processing flags)
01 WS-TABLE-AREAS. (lookup tables)
📊 Data Division Survey Checklist: - [ ] How many files? Input vs. output? - [ ] How many record types (REDEFINES on FD records)? - [ ] What are the key fields (account numbers, claim numbers)? - [ ] What flags control processing flow (88-level items)? - [ ] What tables are loaded (OCCURS DEPENDING ON)? - [ ] Where are the financial totals (COMP-3 accumulators)?
Step 3: Trace the Main Control Flow
Look at the PROCEDURE DIVISION's main paragraph:
0000-MAIN-CONTROL.
PERFORM 1000-INITIALIZE
PERFORM 2000-PROCESS UNTIL WS-EOF
PERFORM 9000-TERMINATE
STOP RUN.
This gives you the skeleton. Now trace each PERFORM one level deep:
2000-PROCESS.
PERFORM 2100-READ-INPUT
IF WS-VALID-RECORD
PERFORM 2200-VALIDATE
IF WS-PASSES-VALIDATION
PERFORM 2300-ADJUDICATE
PERFORM 2400-CALCULATE-PAYMENT
PERFORM 2500-WRITE-OUTPUT
ELSE
PERFORM 2600-WRITE-ERROR
END-IF
END-IF.
Now you have the program's story: read a claim, validate it, adjudicate it, calculate payment, write the result. The details are in the sub-paragraphs, but you understand the narrative.
Step 4: Build a Call Graph
A call graph shows which paragraphs call which other paragraphs. For a large program, this is essential:
0000-MAIN-CONTROL
├── 1000-INITIALIZE
│ ├── 1100-OPEN-FILES
│ ├── 1200-LOAD-TABLES
│ └── 1300-READ-PARMS
├── 2000-PROCESS
│ ├── 2100-READ-INPUT
│ ├── 2200-VALIDATE
│ │ ├── 2210-CHECK-MEMBER
│ │ ├── 2220-CHECK-PROVIDER
│ │ └── 2230-CHECK-ELIGIBILITY
│ ├── 2300-ADJUDICATE
│ │ ├── 2310-DETERMINE-COVERAGE
│ │ ├── 2320-APPLY-DEDUCTIBLE
│ │ ├── 2330-APPLY-COPAY
│ │ └── 2340-APPLY-LIMITS
│ ├── 2400-CALCULATE-PAYMENT
│ └── 2500-WRITE-OUTPUT
└── 9000-TERMINATE
├── 9100-VERIFY-TOTALS
├── 9200-WRITE-REPORT
└── 9300-CLOSE-FILES
💡 Key Insight: You can build a call graph quickly by searching for PERFORM statements. In a well-structured program, the call graph reveals the entire business process at a glance. In a poorly structured program, it reveals exactly where the complexity hides.
Step 5: Focus on the Business Logic
The business rules live in the detail paragraphs — the ones that do the actual calculations, validations, and decisions. These are the paragraphs you need to understand most carefully:
2310-DETERMINE-COVERAGE.
* This is where the money is.
* This EVALUATE determines how much the plan pays.
EVALUATE CLM-PLAN-TYPE ALSO CLM-SVC-CATEGORY
WHEN 'HMO' ALSO 'PREV' ...
WHEN 'PPO' ALSO 'SPEC' ...
WHEN OTHER ...
END-EVALUATE.
41.3 Data Flow Analysis
Understanding where data comes from and where it goes is often more important than understanding the procedural logic. Data flow analysis answers the question: for a given field, what value does it hold, and how did it get there?
Forward Tracing
Start with an input field and trace it forward through the program:
CLM-CHARGED-AMOUNT (input record)
→ MOVE to WS-CHARGED-AMT (working storage)
→ Used in COMPUTE WS-ALLOWED-AMT (2320-APPLY-DEDUCTIBLE)
→ WS-ALLOWED-AMT used in COMPUTE WS-PLAN-PAYS (2340-APPLY-LIMITS)
→ WS-PLAN-PAYS MOVE to OUT-PAYMENT-AMOUNT (output record)
This trace shows you exactly how the input charge becomes the output payment.
Backward Tracing
Start with an output field and trace it backward to find its origin:
OUT-PAYMENT-AMOUNT (what we need to understand)
← MOVE from WS-PLAN-PAYS
← COMPUTED in 2340-APPLY-LIMITS
← Uses WS-ALLOWED-AMT and WS-PLAN-PCT
← WS-ALLOWED-AMT computed in 2320-APPLY-DEDUCTIBLE
← Uses WS-CHARGED-AMT minus WS-DEDUCTIBLE-AMT
← WS-CHARGED-AMT from CLM-CHARGED-AMOUNT (input)
Using grep for Data Flow
On a Unix/Linux system or with GnuCOBOL, you can use grep to quickly trace a field:
# Find every reference to a field
grep -n "WS-ALLOWED-AMT" CLMADJ01.cbl
# Find where a field is set (MOVE, COMPUTE, ADD, etc.)
grep -n "TO WS-ALLOWED-AMT\|WS-ALLOWED-AMT =" CLMADJ01.cbl
grep -n "COMPUTE WS-ALLOWED-AMT" CLMADJ01.cbl
# Find where a field is used (but not set)
grep -n "WS-ALLOWED-AMT" CLMADJ01.cbl | grep -v "TO WS-ALLOWED-AMT"
On the mainframe, ISPF's FIND command (or SDSF) serves the same purpose:
Command ===> FIND WS-ALLOWED-AMT ALL
📊 Data Flow Symbols
When documenting data flow, use these conventions: - → Direct MOVE - ⇒ Computed from (COMPUTE, ADD, SUBTRACT) - ? Conditional assignment (IF/EVALUATE) - ⟲ Loop accumulation (ADD within PERFORM loop) - ⊗ Unchanged (pass-through from input to output)
41.4 Impact Analysis
Impact analysis answers the question: if I change this field/paragraph/copybook, what else is affected?
Field Impact
To assess the impact of changing a field:
- Find all references in the current program (grep or FIND)
- Find all COPY members that define the field
- Find all programs that COPY the same copybook
- Find all JCL that uses the same file
# Find all programs that use a copybook
grep -l "COPY CLAIMCPY" /path/to/source/*.cbl
# Find all JCL that references a dataset
grep -l "MEDCL.CLAIMS.PENDING" /path/to/jcl/*.jcl
Paragraph Impact
Changing a paragraph affects: - Every paragraph that PERFORMs it - Every field it modifies (downstream effects) - Any paragraph it PERFORMs (if you change how it calls them)
Ripple Effect Analysis
Change: Modify WS-DEDUCTIBLE calculation in 2320-APPLY-DEDUCTIBLE
Direct impact:
- WS-DEDUCTIBLE-AMT changes
- WS-ALLOWED-AMT changes (depends on WS-DEDUCTIBLE-AMT)
Downstream impact:
- WS-PLAN-PAYS changes (depends on WS-ALLOWED-AMT)
- OUT-PAYMENT-AMOUNT changes (depends on WS-PLAN-PAYS)
- WS-TOTAL-PAYMENTS accumulator changes
- Control total report changes
- Payment file downstream systems affected
Indirect impact:
- GL reconciliation may fail (different total)
- Provider payment amounts change
- EOB (Explanation of Benefits) amounts change
- Regulatory reports affected
⚠️ The Iceberg Principle: The visible change (modifying one paragraph) is the tip. The downstream, indirect, and cross-program impacts are the iceberg beneath the surface. Impact analysis maps the entire iceberg before you start changing code.
41.5 Reverse Engineering Business Rules
The most valuable output of code archaeology is a business rule catalog — a plain-language description of every decision the program makes.
Extracting Rules from EVALUATE
EVALUATE statements are the richest source of business rules:
EVALUATE CLM-PLAN-TYPE
ALSO CLM-SVC-CATEGORY
ALSO CLM-NETWORK-STATUS
WHEN 'HMO' ALSO 'PREV' ALSO 'IN '
MOVE 100 TO WS-COVERAGE-PCT
MOVE 0 TO WS-COPAY
WHEN 'HMO' ALSO 'PREV' ALSO 'OUT'
MOVE 0 TO WS-COVERAGE-PCT
MOVE 0 TO WS-COPAY
Extracted business rule: | # | Rule | Plan | Service | Network | Coverage | Copay | |---|------|------|---------|---------|----------|-------| | BR-001 | HMO preventive in-network | HMO | Preventive | In | 100% | $0 | | BR-002 | HMO preventive out-of-network | HMO | Preventive | Out | 0% | $0 |
Extracting Rules from IF Statements
Nested IF statements encode business rules too, but they are harder to extract:
IF CLM-AMOUNT > 50000
IF CLM-PRE-AUTH = 'Y'
IF CLM-AUTH-DAYS-REMAINING > 0
PERFORM 2500-PROCESS-LARGE-CLAIM
ELSE
MOVE 'AUTH-EXP' TO WS-DENY-REASON
PERFORM 2600-DENY-CLAIM
END-IF
ELSE
MOVE 'NO-AUTH' TO WS-DENY-REASON
PERFORM 2600-DENY-CLAIM
END-IF
ELSE
PERFORM 2500-PROCESS-NORMAL-CLAIM
END-IF
Extracted business rules: - BR-010: Claims over $50,000 require pre-authorization - BR-011: Pre-authorization must not be expired (days remaining > 0) - BR-012: Claims $50,000 or less do not require pre-authorization - BR-013: Denied reasons: AUTH-EXP (expired authorization), NO-AUTH (no authorization on file)
Building a Business Rule Catalog
Document each extracted rule in a standard format:
Rule ID: BR-010
Source: CLMADJ01.cbl, paragraph 2400-CHECK-AUTH, line 847
Description: Claims with charged amount exceeding $50,000
require valid pre-authorization
Condition: CLM-AMOUNT > 50000
Action: If no pre-auth, deny with reason NO-AUTH
If pre-auth expired, deny with reason AUTH-EXP
If pre-auth valid, process as large claim
Dependencies: CLM-AMOUNT (input), CLM-PRE-AUTH (input),
CLM-AUTH-DAYS-REMAINING (calculated in 2350)
Last Modified: Unknown (no change history)
Verified By: [analyst initials and date]
✅ Best Practice: Have a business analyst review your extracted rules. Code tells you what the program does, but only a domain expert can confirm that what it does is correct. You may discover that the code implements a rule that is no longer valid, or implements it differently than the business intends.
41.6 Documentation Recovery
Documentation recovery is the process of creating documentation for a system that has none. It is different from documentation writing — you are not designing something new; you are describing something that already exists.
The Documentation Stack
Build documentation from the bottom up:
- Data Dictionary: Every field, its type, its purpose, its valid values
- Program Inventory: Every program, its purpose, its inputs and outputs
- Call Graph / Program Flow: How programs call each other
- Business Rule Catalog: Every decision the system makes
- Job Stream Map: How jobs relate and depend on each other
- System Overview: High-level architecture and data flow
Data Dictionary Template
Field Name: CLM-CHARGED-AMOUNT
Program: CLMADJ01
Copybook: CLAIMCPY
PIC: S9(07)V99 COMP-3
Description: Total amount charged by provider for the
service. Used as the starting point for
payment calculation.
Valid Range: 0.01 to 9,999,999.99
Source: Input file CLMIN (MEDCL.CLAIMS.PENDING)
Derived From: Provider submission (EDI 837 or paper)
Used In: 2300-ADJUDICATE (calculate allowed amount)
2400-CALCULATE-PAYMENT (determine plan pays)
9100-VERIFY-TOTALS (accumulate for reconciliation)
Related Fields: WS-ALLOWED-AMT, WS-PLAN-PAYS,
OUT-PAYMENT-AMOUNT
Program Inventory Template
Program ID: CLMADJ01
Description: Claims adjudication - applies benefit
rules to pending claims and calculates
payment amounts
Language: COBOL (Enterprise COBOL 5.2)
Lines of Code: 4,247
Input Files: CLMIN (claims pending),
PAYTBL (payment schedule)
Output Files: CLMOUT (adjudicated claims),
ERRRPT (error report),
CTLTOT (control totals)
Called By: JCL CLMADJ step in MEDCLAIM nightly batch
Calls: DATECALC (date calculation subprogram)
Copybooks: CLAIMCPY, PAYCPY, MEMBCPY, ERRCPY
DB2 Tables: None (file-based)
Key Business Rules: BR-001 through BR-047
Last Modified: 2019-04-17 (per Endevor history)
Modified By: J. Ramirez (no longer with company)
41.7 Tools for Code Archaeology
IBM Application Discovery and Delivery Intelligence (ADDI)
IBM ADDI (formerly IBM Application Discovery) automatically analyzes COBOL programs and produces: - Program call graphs - Data flow diagrams - Cross-reference reports - Dead code identification - Complexity metrics
COBOL Analyzers
Several vendor tools provide static analysis: - Micro Focus (OpenText) Enterprise Analyzer: Cross-reference, flow analysis, impact analysis - Compuware (BMC) Topaz: Program visualization, data lineage - Sonar COBOL Plugin: Code quality metrics, complexity measurement
Command-Line Techniques
When you do not have access to commercial tools, command-line utilities are surprisingly effective:
# Count paragraphs
grep -c "^ [0-9A-Z].*\.$" program.cbl
# List all paragraph names
grep "^ [0-9A-Z].*\.$" program.cbl
# Find all PERFORM statements
grep "PERFORM " program.cbl
# Find all file I/O operations
grep -E "READ |WRITE |REWRITE |DELETE |START " program.cbl
# Find all MOVE statements for a specific field
grep "MOVE.*TO WS-AMOUNT\|MOVE WS-AMOUNT" program.cbl
# Find GO TO statements (potential spaghetti code)
grep "GO TO" program.cbl
# Count lines of code (excluding comments and blanks)
grep -v "^......\*\|^$" program.cbl | wc -l
# Find all COPY statements
grep "COPY " program.cbl
# Find EVALUATE statements (business rule locations)
grep -c "EVALUATE" program.cbl
# Find all 88-level items (flags and conditions)
grep "88 " program.cbl
Building a Cross-Reference
A cross-reference lists every variable and every paragraph, showing where each is defined and referenced:
# Simple cross-reference generator (bash)
#!/bin/bash
PROGRAM=$1
echo "=== Cross-Reference for $PROGRAM ==="
# Extract all Working-Storage variables
echo "--- Variables ---"
grep "05 \|10 \|15 " "$PROGRAM" | \
awk '{print $2}' | sort -u | while read VAR; do
COUNT=$(grep -c "$VAR" "$PROGRAM")
echo " $VAR ($COUNT references)"
done
# Extract all paragraphs
echo "--- Paragraphs ---"
grep "^ [0-9A-Z].*\.$" "$PROGRAM" | while read PARA; do
NAME=$(echo "$PARA" | awk '{print $1}' | tr -d '.')
CALLS=$(grep -c "PERFORM $NAME" "$PROGRAM")
echo " $NAME (called $CALLS times)"
done
41.8 Common Legacy Patterns and Anti-Patterns
Knowing what to look for accelerates your understanding. Here are patterns you will encounter repeatedly:
Pattern: The Status Code
Legacy programs often use numeric status codes instead of named conditions:
MOVE 1 TO WS-STATUS
* What does status 1 mean? You must find where
* WS-STATUS is tested to understand.
...
IF WS-STATUS = 1
PERFORM 3000-NORMAL
ELSE IF WS-STATUS = 2
PERFORM 4000-ERROR
ELSE IF WS-STATUS = 3
PERFORM 5000-SKIP
Archaeology technique: Search for every reference to WS-STATUS. Map each value to its meaning. Create a legend.
Anti-Pattern: PERFORM THRU with Fall-Through
PERFORM 2000-START THRU 2999-END.
2000-START.
... some logic ...
2100-MIDDLE.
... more logic ...
2999-END.
EXIT.
The PERFORM THRU executes every paragraph from 2000-START through 2999-END sequentially. Any paragraph in that range is part of the PERFORM, even if it looks independent. This makes it dangerous to insert new paragraphs in the range.
Archaeology technique: Identify all PERFORM THRU ranges. Mark the start and end paragraphs. Be extremely careful not to add or remove paragraphs within the range.
Anti-Pattern: GO TO Spaghetti
1000-PROCESS.
IF WS-TYPE = 'A'
GO TO 3000-TYPE-A
END-IF
IF WS-TYPE = 'B'
GO TO 4000-TYPE-B
END-IF
GO TO 5000-DEFAULT.
3000-TYPE-A.
...
GO TO 6000-CONTINUE.
4000-TYPE-B.
...
IF WS-SPECIAL = 'Y'
GO TO 3000-TYPE-A
END-IF
GO TO 6000-CONTINUE.
Archaeology technique: Draw a GO TO flow diagram. Identify all entry points and exit points for each paragraph. This is the only way to understand the actual control flow.
Pattern: The Working Storage Calculator
01 WS-WORK-1 PIC S9(11)V99 COMP-3.
01 WS-WORK-2 PIC S9(11)V99 COMP-3.
01 WS-WORK-3 PIC S9(11)V99 COMP-3.
01 WS-WORK-4 PIC S9(11)V99 COMP-3.
These generic work fields are reused throughout the program for different purposes. WS-WORK-1 might hold a balance in one paragraph and a payment amount in another.
Archaeology technique: Trace each work field through every paragraph that uses it. Document what it holds at each point. Flag places where it is repurposed.
Pattern: The Implicit Record Type
01 INPUT-RECORD.
05 INP-REC-TYPE PIC X(02).
05 INP-DATA PIC X(198).
01 INP-HEADER REDEFINES INPUT-RECORD.
05 FILLER PIC X(02).
05 HDR-DATE PIC 9(08).
05 HDR-SOURCE PIC X(10).
01 INP-DETAIL REDEFINES INPUT-RECORD.
05 FILLER PIC X(02).
05 DTL-ACCOUNT PIC X(10).
05 DTL-AMOUNT PIC S9(09)V99 COMP-3.
01 INP-TRAILER REDEFINES INPUT-RECORD.
05 FILLER PIC X(02).
05 TRL-RECORD-COUNT PIC 9(08).
05 TRL-TOTAL-AMOUNT PIC S9(13)V99 COMP-3.
The first two bytes determine which REDEFINES to use. This is a common and legitimate pattern, but it can be confusing if you do not recognize it.
Archaeology technique: Search for all REDEFINES on the FD record. Map each record type code to its corresponding REDEFINES. This reveals the file format.
41.9 Working with Tribal Knowledge
In many organizations, the most important documentation is not written down — it exists only in the heads of experienced developers, operators, and business users. This is tribal knowledge, and capturing it before people retire or leave is one of the most urgent challenges in mainframe shops.
Interviewing Techniques
When you have access to someone who knows the system, use structured interview techniques:
Start broad: - "What does this system do in business terms?" - "What would happen if this program stopped running?" - "What are the most common problems with this program?"
Then narrow: - "Why does paragraph 4700 exist? What special case does it handle?" - "This EVALUATE has 47 WHEN clauses. Are all of them still valid?" - "I see the deductible is calculated differently for plan type 'X'. Why?"
Record everything: - Take detailed notes - Record the conversation (with permission) - Follow up with a written summary for the interviewee to review
The Knowledge Transfer Checklist
When a senior developer announces retirement, prioritize knowledge transfer for:
- Programs that only they maintain (single points of knowledge)
- Programs with no documentation (no alternative information source)
- Programs with unusual behavior ("Oh, that program has a special mode for leap years that is not documented anywhere")
- Production incident history ("In 2019, we had to add paragraph 4700 because...")
- Business rules that are not obvious from the code ("The regulatory requirement changed but we did not update the comments")
🔴 The Retirement Risk: The U.S. Government Accountability Office (GAO) has repeatedly warned about the risk of mainframe knowledge loss. The average COBOL developer is over 55 years old. When they retire, they take decades of context with them. Knowledge transfer is not optional — it is a business continuity issue.
41.10 GlobalBank: Archaeology of a 1987 Module
Maria Chen received a request to modify a program called GLRECON — the general ledger reconciliation module. The program had been written in 1987, modified sporadically through the 1990s, and untouched since 2003. It was 3,800 lines of COBOL with 12 comment lines (all in the IDENTIFICATION DIVISION).
Maria's Approach
Day 1: External Evidence
Maria started with the JCL: - Input: GBANK.GL.DAILY (general ledger transactions), GBANK.ACCT.MASTER (account master) - Output: GBANK.GL.RECONCILED, GBANK.GL.EXCEPTIONS, a SYSOUT report - Run position: Step 7 in the nightly batch, after interest calculation, before statement generation
This told her: GLRECON compares account balances against the general ledger, finds discrepancies, and writes exceptions for investigation.
Day 2: Data Division Survey
She cataloged the data structures: - 4 input/output files - 47 working storage variables (many with unhelpful names like WS-A1, WS-A2, WS-B1) - 3 tables loaded from a parameter file - 23 flags (88-level items)
Day 3: Call Graph
She built the call graph and found the program was mostly well-structured, with one exception: paragraphs 4700 through 4799 were a PERFORM THRU block that handled "special reconciliation adjustments" for a set of account types that had different GL mappings.
Day 4: Business Rule Extraction
The most complex paragraph was 3200-COMPARE-BALANCES. It contained an EVALUATE with 31 WHEN clauses, each mapping an account type to a GL category. Some of the account types had comments from the original 1987 development; others had been added later with no comments at all.
Maria created a business rule matrix:
| Account Type | GL Category | Reconciliation Rule | Added |
|---|---|---|---|
| CHK | 1000 | Balance must match within $0.01 | 1987 |
| SAV | 1010 | Balance must match within $0.01 | 1987 |
| CD | 1020 | Balance must match exactly | 1987 |
| MMA | 1030 | Balance must match within $0.01 | 1993 |
| IRA | 1040 | Special accrual adjustment | 1997 |
| HSA | 1050 | Separate reconciliation path | 2003 |
Day 5: The Mystery of Paragraph 4700
Paragraph 4700-SPECIAL-RECON contained logic that nobody on the current team understood. It read a parameter file with a list of account numbers and applied a fixed adjustment to their GL balances before comparison. Maria found 15 account numbers in the parameter file, some belonging to accounts that had been closed for years.
She tracked down Harold Mercer, a retired developer who had worked at GlobalBank in the 1990s. Through a phone call (arranged by the HR department's alumni network), she learned: "Those are accounts that were involved in a system conversion in 1995. The conversion left a rounding difference on each account, and rather than fix the underlying data, we added the adjustment to the reconciliation program. We always meant to go back and fix it properly."
That conversation took 20 minutes and saved Maria weeks of confusion.
The Outcome
Maria produced: 1. A 24-page documentation package for GLRECON 2. A business rule catalog with 31 reconciliation rules 3. A recommendation to remove the 4700-SPECIAL-RECON logic (the 15 accounts no longer existed) 4. Her original modification (adding a new GL category for a new account type), completed confidently because she now understood the system
⚖️ Theme — The Human Factor: Maria's archaeology of GLRECON illustrates the human factor theme perfectly. The code was the easy part — she could read COBOL. The hard part was understanding why the code was written that way. That answer was in Harold Mercer's memory, not in the source code.
41.11 MedClaim: Reverse Engineering Adjudication Rules
James Okafor faced a different challenge. MedClaim's adjudication program, CLM-ADJUD, had accumulated business rules over 20 years. Different developers had added rules at different times, and there was no single document listing all the adjudication rules currently in effect. The compliance team needed a complete catalog for a regulatory audit.
The Scale of the Problem
CLM-ADJUD was 7,200 lines of COBOL. It contained: - 14 EVALUATE statements - 47 nested IF blocks - 23 PERFORM THRU ranges - 112 distinct business rules (as it turned out)
Sarah Kim's Contribution
Sarah Kim, the business analyst, worked alongside James. For each rule James extracted from the code, Sarah verified it against the current benefit plan documentation:
- 82 rules matched the current documentation exactly
- 19 rules were implemented correctly but undocumented
- 7 rules implemented a requirement that had since been superseded
- 4 rules were incorrect (implemented a misunderstood requirement)
The 7 superseded rules were legacy artifacts — they had been correct when written but the regulations had changed without the code being updated. Because the rules were lenient (they approved claims that should have been denied under new rules), they had not caused visible problems, but they represented a compliance risk.
The 4 incorrect rules were a discovery that James described as "finding a ticking time bomb." One of them incorrectly calculated the out-of-pocket maximum for a specific plan type, potentially exposing MedClaim to regulatory penalties.
The Archaeology Process
James used a structured approach:
- Pass 1 — EVALUATE extraction: Documented every EVALUATE statement and its WHEN clauses
- Pass 2 — IF extraction: Documented every nested IF block (focusing on those containing financial calculations)
- Pass 3 — Cross-reference: Linked each rule to its input fields and output effects
- Pass 4 — Business verification: Sarah verified each rule against plan documentation
- Pass 5 — Gap analysis: Identified rules in the plan documentation that had no corresponding code
The complete catalog took three weeks to produce. James estimated it would have taken three months without Sarah's domain knowledge.
🔵 MedClaim Lesson: Code archaeology is most effective when a technical person (who can read the code) works alongside a domain expert (who knows the business). Neither one alone can produce a complete and accurate understanding.
41.12 Try It Yourself: Analyzing an Unknown Program
Student Lab Exercise
The code directory for this chapter contains a COBOL program called MYSTERY.cbl. It is 800 lines of intentionally undocumented code with unhelpful variable names and no comments (beyond the PROGRAM-ID).
Your assignment:
- Do NOT read the program from top to bottom. Follow the systematic approach from this chapter.
- Start with the PROCEDURE DIVISION's main control paragraph. What is the program's structure?
- Build a call graph showing all PERFORM relationships.
- Survey the DATA DIVISION. How many files? What are the record layouts?
- Trace the data flow for the primary input field through to the primary output field.
- Extract at least 5 business rules from EVALUATE and IF statements.
- Write a one-page summary describing what the program does, as if you were explaining it to a new team member.
Time limit: 2 hours. This simulates a real-world scenario where you need to quickly understand an unfamiliar program.
41.13 Call Graph Construction Methods
Building a call graph is one of the first steps in code archaeology (section 41.2, Step 4). For small programs, you can trace PERFORM statements manually. For large systems — thousands of programs across hundreds of copybooks — you need systematic methods.
Method 1: grep-Based Call Graph
The simplest automated approach uses grep to extract PERFORM statements and build a tree:
#!/bin/bash
# call-graph.sh — Build a call graph from a COBOL program
# Usage: ./call-graph.sh PROGRAM.cbl
PROGRAM=$1
echo "=== Call Graph for $PROGRAM ==="
echo ""
# Extract all paragraph names (defined in col 8-72, ending with period)
echo "--- Paragraph Definitions ---"
grep -n "^ [0-9A-Z][0-9A-Z-]*\." "$PROGRAM" | \
sed 's/\.$//' | \
awk -F: '{printf " Line %4d: %s\n", $1, $2}'
echo ""
echo "--- PERFORM Relationships ---"
# For each paragraph, find what it PERFORMs
grep -n "^ [0-9A-Z][0-9A-Z-]*\." "$PROGRAM" | \
sed 's/\.$//' | \
awk -F: '{print $2}' | \
while read PARA; do
PARA_NAME=$(echo "$PARA" | awk '{print $1}')
# Find the line range of this paragraph (to next paragraph)
START=$(grep -n "^ ${PARA_NAME}\." "$PROGRAM" | \
head -1 | cut -d: -f1)
# Find next paragraph start
END=$(awk -v start="$START" \
'NR > start && /^ [0-9A-Z][0-9A-Z-]*\./ \
{print NR; exit}' "$PROGRAM")
[ -z "$END" ] && END=$(wc -l < "$PROGRAM")
# Extract PERFORMs within this paragraph
PERFORMS=$(sed -n "${START},${END}p" "$PROGRAM" | \
grep "PERFORM " | \
grep -oP "PERFORM \K[0-9A-Z][0-9A-Z-]*" | \
sort -u)
if [ -n "$PERFORMS" ]; then
echo " $PARA_NAME"
echo "$PERFORMS" | while read P; do
echo " └── $P"
done
fi
done
This script produces output like:
--- PERFORM Relationships ---
0000-MAIN-CONTROL
└── 1000-INITIALIZATION
└── 2000-PROCESS
└── 9000-TERMINATION
2000-PROCESS
└── 2100-READ-INPUT
└── 2200-VALIDATE
└── 2300-ADJUDICATE
└── 2400-CALCULATE-PAYMENT
└── 2500-WRITE-OUTPUT
2200-VALIDATE
└── 2210-CHECK-MEMBER
└── 2220-CHECK-PROVIDER
└── 2230-CHECK-ELIGIBILITY
Method 2: The Compiler Cross-Reference
Enterprise COBOL produces a cross-reference listing when compiled with the XREF option:
//COMPILE EXEC PGM=IGYCRCTL,
// PARM='XREF(FULL),MAP,LIST,OFFSET'
The XREF listing shows, for every data name and procedure name, every line where it is referenced and whether the reference is a definition, modification, or use. This is the most authoritative call graph source because it comes directly from the compiler.
Cross-Reference of Procedures
Paragraph Defined References
0000-MAIN-CONTROL 47
1000-INITIALIZATION 53 P47
1100-OPEN-FILES 68 P55
1200-LOAD-TABLES 89 P56
2000-PROCESS 112 P48
2100-READ-INPUT 118 P113
2200-VALIDATE 125 P114
2210-CHECK-MEMBER 148 P126
...
The "P47" means "PERFORMed at line 47." This tells you that 1000-INITIALIZATION is called from line 47, which is within 0000-MAIN-CONTROL.
Method 3: IBM ADDI (Application Discovery)
For large-scale archaeology across thousands of programs, IBM ADDI automatically: - Scans all COBOL programs in a library - Builds inter-program call graphs (CALL statements across programs) - Maps copybook usage across all programs - Identifies dead code (paragraphs never PERFORMed) - Produces visual dependency diagrams
The output is a web-based interactive graph where you can click on a program to see everything it calls and everything that calls it. For a system with 2,000 COBOL programs, this is the only practical approach to understanding the full architecture.
📊 Call Graph Complexity Metrics
| Metric | Description | Concern Level |
|---|---|---|
| Max depth | Deepest nesting of PERFORMs | > 8 levels = complex |
| Fan-out | Most paragraphs called by any single paragraph | > 10 = possibly doing too much |
| Fan-in | Most callers for any single paragraph | High fan-in = widely reused utility |
| Orphan paragraphs | Paragraphs never PERFORMed | Possible dead code |
| Circular references | A PERFORMs B PERFORMs A | Recursive logic (rare in COBOL) |
41.14 Data Dictionary Recovery
When no data dictionary exists, you must build one from the source code. This is tedious but invaluable — once complete, it becomes the authoritative reference for everyone working on the system.
Automated Data Dictionary Extraction
#!/bin/bash
# extract-data-dict.sh — Extract data items from COBOL source
# Produces a CSV file of all data items
PROGRAM=$1
OUTPUT="${PROGRAM%.cbl}-data-dict.csv"
echo "Field Name,Level,PIC,Usage,Defined In,Line" > "$OUTPUT"
# Extract all data items with their PIC clauses
awk '
/^ [0-9][0-9] / {
level = $1
name = $2
pic = ""
usage = "DISPLAY" # default
# Look for PIC clause on this line or continuation
if (match($0, /PIC[TURE]*\s+([^ .]+)/, arr)) {
pic = arr[1]
}
if (match($0, /COMP-3/)) usage = "COMP-3"
if (match($0, /COMP[^-]/)) usage = "COMP"
if (match($0, /BINARY/)) usage = "BINARY"
if (name != "FILLER") {
printf "%s,%s,%s,%s,%s,%d\n", name, level, pic, usage,
FILENAME, NR
}
}' "$PROGRAM" >> "$OUTPUT"
echo "Data dictionary written to $OUTPUT"
echo "$(wc -l < "$OUTPUT") fields extracted"
The Data Dictionary Template
For each significant field, create a detailed entry:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
DATA DICTIONARY ENTRY
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Field: CLM-OUT-OF-POCKET-ACCUM
Level: 05
Parent: CLM-MEMBER-BENEFITS
PIC: S9(07)V99 COMP-3
Bytes: 5
Description: Year-to-date out-of-pocket accumulator for
the member. Includes deductible payments,
copays, and coinsurance. Excludes premium
payments and non-covered services.
Valid Range: 0.00 to 99,999.99 (per plan year)
Reset: Set to 0.00 on January 1 (or plan anniversary)
by BNFRESET batch program.
Set By: CLM-ADJ paragraph 3400-UPDATE-ACCUMULATORS
CLM-VOID paragraph 2200-REVERSE-ACCUMULATORS
Used By: CLM-ADJ paragraph 2310-CHECK-OOP-MAX
ELIGCHK paragraph 4000-CALC-PATIENT-COST
BNFSTMT paragraph 3100-MEMBER-SUMMARY
Related: CLM-OOP-MAXIMUM (annual limit)
CLM-DEDUCTIBLE-ACCUM (subset of OOP)
Source: CLMCPY copybook
Programs: CLM-ADJ, CLM-VOID, ELIGCHK, BNFSTMT, BNFRESET
Notes: When OOP accumulator reaches OOP maximum,
plan pays 100% (no further patient cost).
This is a federal ACA requirement.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
💡 Key Insight: The most valuable part of a recovered data dictionary is the "Notes" field — the business context that is not captured in the PIC clause. At MedClaim, Sarah Kim contributed the notes for every field in the claims processing data dictionary, adding business context that no amount of code reading could reveal.
41.15 COBOL Cross-Reference Analysis
A cross-reference report is one of the most powerful tools for understanding legacy code. It tells you, for every name in the program, exactly where and how it is used.
Generating Cross-References
Enterprise COBOL (z/OS):
//COMPILE EXEC PGM=IGYCRCTL,
// PARM='XREF(FULL),VBREF,MAP,OFFSET'
XREF(FULL)produces a cross-reference of data names and procedure namesVBREFadds verb cross-reference (every COBOL verb and where it is used)MAPproduces a data map showing the offset and length of every field
GnuCOBOL:
cobc -x --listing --cross-reference PROGRAM.cbl
Reading the Cross-Reference
The data name cross-reference tells you which paragraphs modify each field:
Data Name Defn References
WS-PAYMENT-AMOUNT 047 M125 M237 R089 R142 R318 C415
^ ^ ^ ^ ^ ^
| | | | | |
| | | | | Compared at 415
| | | | Referenced at 318
| | | Referenced at 142
| | Referenced at 89
| Modified at 237
Modified at 125
Legend: M = Modified R = Referenced C = Compared
This immediately tells you that WS-PAYMENT-AMOUNT is set at lines 125 and 237, and used (read) at lines 89, 142, 318, and 415. If you need to understand how the payment amount is calculated, you examine lines 125 and 237. If you need to understand who depends on it, you examine lines 89, 142, 318, and 415.
Using Cross-References for Impact Analysis
When asked "what happens if I change the size of CLM-CHARGED-AMOUNT from PIC S9(07)V99 to PIC S9(09)V99?", the cross-reference tells you every program and every line that references this field:
# Find all programs that reference CLM-CHARGED-AMOUNT
# (search across all source in the library)
grep -rn "CLM-CHARGED-AMOUNT" /path/to/source/*.cbl
# Find the copybook where it is defined
grep -rn "CLM-CHARGED-AMOUNT" /path/to/copybooks/*.cpy
From the cross-reference, you build an impact matrix:
| Program | Lines Affected | Type of Reference | Risk |
|---|---|---|---|
| CLM-ADJ | 125, 237, 318, 415 | M, M, R, C | HIGH — calculates payment from this field |
| CLM-RPT | 089, 142 | R, R | LOW — display only |
| CLM-VOID | 201, 245, 267 | M, R, C | HIGH — reversal logic |
| ELIGCHK | none | none | NONE — does not reference this field |
⚠️ The Hidden Impact: Changing a field's PIC size in a copybook affects every program that copies it — even programs that do not directly reference the field. The copybook change shifts the offsets of every field defined after it, which can corrupt data in programs that use REDEFINES or reference modification on the record.
41.16 Working with Tribal Knowledge: Advanced Techniques
Section 41.9 introduced the concept of tribal knowledge and basic interviewing techniques. Here we explore additional methods for capturing and preserving institutional knowledge.
The Annotated Code Review
Sit with the subject matter expert (SME) and walk through the code on screen, recording their running commentary:
* [SME annotation session — Harold Mercer, 2026-02-15]
*
* Harold: "This paragraph is the heart of the program.
* It was originally just the first three IF blocks.
* The rest was added over the years."
*
4700-SPECIAL-RECON.
* Harold: "These 15 accounts were left over from the
* 1995 conversion. Each has a small rounding error.
* The adjustment amounts are in the PARM file."
PERFORM VARYING WS-ADJ-IDX
FROM 1 BY 1
UNTIL WS-ADJ-IDX > WS-ADJ-COUNT
IF ACCT-NUMBER =
WS-ADJ-ACCT(WS-ADJ-IDX)
* Harold: "The adjustment is always a credit
* because the conversion shorted each account
* by a few cents."
ADD WS-ADJ-AMOUNT(WS-ADJ-IDX)
TO WS-GL-BALANCE
END-IF
END-PERFORM.
*
* Harold: "If the adjustment accounts are ever closed,
* you can remove this entire paragraph and the
* parameter file. Nobody will miss it."
This annotated code becomes permanent documentation. The SME's words, attached to the specific lines they explain, are worth more than any amount of after-the-fact documentation.
The Scenario Walkthrough
Instead of asking general questions about code, walk through specific scenarios:
Scenario: "A claim comes in for $75,000 with a plan type of PPO and no pre-authorization. Walk me through what happens."
The SME traces the execution path while you document each decision point, field assignment, and branch. This produces a concrete trace that validates your understanding of the business rules.
Documentation Templates for Knowledge Transfer
Program Summary Template:
PROGRAM KNOWLEDGE TRANSFER DOCUMENT
====================================
Program: GLRECON
SME: Harold Mercer (retired, phone available)
Date: 2026-02-15
Scribe: Maria Chen
PURPOSE:
[1-2 sentence business description]
WHEN IT RUNS:
[Schedule, dependencies, batch position]
WHAT CAN GO WRONG:
[Top 3 failure modes and recovery procedures]
BUSINESS RULES NOT OBVIOUS FROM CODE:
1. [Rule description, which paragraph, why it exists]
2. [...]
3. [...]
HISTORY / WHY IT IS THE WAY IT IS:
[Key historical context — conversions, mergers,
regulatory changes that shaped the code]
IF YOU NEED TO CHANGE THIS PROGRAM:
[Specific warnings, gotchas, areas of fragility]
CONTACT:
[Who to call if this program fails at 3 AM]
Impact Analysis Tools
Beyond manual grep and cross-reference analysis, several tools specialize in COBOL impact analysis:
IBM ADDI Dependency Analysis: - Traces data flow across programs (field A in program X feeds field B in program Y via a shared file) - Identifies all programs affected by a copybook change - Maps DB2 table usage across all programs
Micro Focus (OpenText) Enterprise Analyzer: - Provides visual impact analysis diagrams - Supports "what-if" analysis — change a field and see the full ripple effect - Tracks data lineage from source to destination across the entire system
BMC Compuware Topaz for Total Test: - Records actual execution paths during testing - Identifies code paths that are never exercised (dead code candidates) - Generates test data from production patterns
📊 Impact Analysis Effort by System Size
| System Size | Programs | Manual Analysis Time | Tool-Assisted Time |
|---|---|---|---|
| Small | 10-50 | 1-3 days | 2-4 hours |
| Medium | 50-500 | 2-4 weeks | 1-3 days |
| Large | 500-5,000 | 3-6 months | 1-3 weeks |
| Enterprise | 5,000+ | Not feasible | 1-3 months |
For systems above 500 programs, manual impact analysis is effectively impossible — the number of cross-program data flows exceeds what a human can track. Tool-assisted analysis is not optional; it is essential.
41.17 Try It Yourself: Building a Cross-Reference Report
Student Lab Exercise
Write a shell script (or COBOL program, if you prefer) that accepts a COBOL source file and produces:
- Paragraph inventory: Every paragraph name, its line number, and how many times it is PERFORMed
- Dead paragraph detection: Paragraphs that are defined but never PERFORMed (excluding the main paragraph)
- Variable inventory: Every WORKING-STORAGE variable with its PIC clause and reference count
- Unused variable detection: Variables defined but never referenced outside their definition
- PERFORM depth analysis: For the main control paragraph, calculate the maximum nesting depth of PERFORM calls
Test your tool against the MYSTERY.cbl program from section 41.12. Compare your tool's output with your manual analysis to verify accuracy.
🧪 Extension Challenge: Enhance your tool to detect PERFORM THRU ranges and flag them with a warning. PERFORM THRU is the single most common source of confusion in legacy COBOL, and automatically identifying these ranges accelerates code understanding significantly.
41.18 Documentation Templates for Legacy Systems
When you complete a code archaeology project, the documentation you produce becomes the authoritative reference for everyone who touches the system. Having standard templates ensures consistency and completeness.
System Overview Template
SYSTEM OVERVIEW: [System Name]
================================
Date: [YYYY-MM-DD]
Author: [Name]
Reviewed By: [Name, Date]
1. BUSINESS PURPOSE
[2-3 sentences describing what this system does
in business terms, not technical terms]
2. SYSTEM BOUNDARIES
Inputs:
- [Source system] → [File/Queue/API] → [Program]
- ...
Outputs:
- [Program] → [File/Queue/API] → [Target system]
- ...
3. PROGRAM INVENTORY
[Program ID] [Lines] [Purpose] [Criticality]
CLMADJ01 4,247 Claims adjudication HIGH
CLMVAL01 2,891 Claims validation HIGH
CLMRPT01 1,456 Claims reporting MEDIUM
...
4. BATCH SCHEDULE
[Job Name] [Schedule] [Duration] [Dependencies]
CLMBATCH1 Daily 23:00 45 min None
CLMBATCH2 Daily 23:45 90 min CLMBATCH1
...
5. DATA STORES
[Dataset/Table] [Type] [Records] [Owner]
MEDCL.CLAIMS.PENDING VSAM ~500K CLM-INT
MEDCL.CLAIMS.ADJUDICATED SEQ ~18K/day CLM-ADJ
MEMBER_COVERAGE DB2 ~2M ELIGCHK
...
6. KNOWN ISSUES
- [Issue description, workaround, risk level]
- ...
7. CHANGE HISTORY (from source control or Endevor)
[Date] [Program] [Developer] [Description]
2019-04-17 CLMADJ01 J.Ramirez Added HSA plan type
...
8. CONTACTS
Primary: [Name, phone, email]
Backup: [Name, phone, email]
Business: [Name, phone, email]
Job Stream Map Template
JOB STREAM MAP: [Stream Name]
==============================
Date: [YYYY-MM-DD]
TRIGGER: [Time/Event/Predecessor]
STEP 1: [Job Name]
Program: [Program ID]
Input: [File(s)]
Output: [File(s)]
Control: [Control total file]
Recovery: [RERUN/RESTART/STOP]
Max RC: [0/4/8]
Notes: [Special considerations]
↓ (RC ≤ 4)
STEP 2: [Job Name]
Program: [Program ID]
Input: [File(s) — output from Step 1]
Control: [Verify against Step 1 control totals]
...
↓ (RC ≤ 4) ↓ (RC = 4, warning)
STEP 3a: [Job] STEP 3b: [Alert Job]
... Notify operations team
Business Rule Catalog Template
BUSINESS RULE CATALOG: [System Name]
=====================================
Version: [N.N]
Date: [YYYY-MM-DD]
Verified By: [Business Analyst Name]
CATEGORY: [Category Name, e.g., "Eligibility Determination"]
BR-[NNN]: [Rule Name]
Source: [Program, Paragraph, Line]
Description: [Plain language description of the rule]
Condition: [Technical condition from code]
Action: [What happens when condition is true]
Action: [What happens when condition is false]
Regulatory: [Regulatory citation, if applicable]
Effective: [Date rule became effective]
Verified: [Y/N] [Date] [Initials]
Notes: [Any additional context]
BR-001: Minimum Eligibility Age
Source: ELIGCHK, 1000-CHECK-MEMBER, line 142
Description: Member must be at least 18 years old for
individual coverage, or be a dependent
under 26 for family coverage
Condition: MEMBER-AGE < 18 AND MEMBER-TYPE ≠ 'DEP'
Action TRUE: Set COMM-NOT-ELIGIBLE, msg 'Under 18'
Action FALSE: Continue eligibility check
Regulatory: ACA Section 2714 (dependents to age 26)
Effective: 2010-09-23
Verified: Y 2026-02-15 SK
Notes: Age 26 cutoff applies to end of birth month
✅ Best Practice: Business rule catalogs should be living documents maintained in version control alongside the source code. When a developer changes a business rule in the code, the corresponding catalog entry should be updated in the same commit. At MedClaim, Sarah Kim reviews every pull request that modifies an EVALUATE or complex IF statement to ensure the business rule catalog stays current.
41.19 MedClaim: The Full Archaeology Report
To illustrate what a complete code archaeology output looks like, here is the table of contents from James Okafor's documentation of the CLM-ADJUD program — the 7,200-line adjudication engine described in section 41.11.
The Deliverables
CLM-ADJUD SYSTEM DOCUMENTATION
Version 1.0 — 2026-03-01
Prepared by: James Okafor, Sarah Kim
Table of Contents:
1. Executive Summary (2 pages)
- What CLM-ADJUD does in business terms
- Why this documentation was created
- Key findings and recommendations
2. System Overview (5 pages)
- Architecture diagram
- Input/output file descriptions
- Batch schedule and dependencies
- Program inventory
3. Data Dictionary (28 pages)
- 147 fields documented
- Each field: name, PIC, description, valid values,
source, destination, business meaning
4. Call Graph (4 pages)
- Visual hierarchy of all 89 paragraphs
- PERFORM THRU ranges highlighted
- Dead code paragraphs identified (3 found)
5. Business Rule Catalog (35 pages)
- 112 rules documented
- Each rule: ID, source location, description,
conditions, actions, regulatory citations
- Verification status (82 confirmed, 19 undocumented,
7 superseded, 4 incorrect)
6. Data Flow Diagrams (8 pages)
- Input-to-output flow for charge amount
- Input-to-output flow for payment amount
- Accumulator flows (deductible, OOP, totals)
7. Known Issues and Recommendations (3 pages)
- 7 superseded rules to remove
- 4 incorrect rules to fix (PRIORITY: HIGH)
- 3 dead code paragraphs to remove
- 23 PERFORM THRU ranges to refactor (long-term)
8. Appendices (12 pages)
- Cross-reference listing
- Test case inventory
- Revision history
This documentation package took three weeks to produce — one week of James reading code and building the call graph, one week of rule extraction with Sarah Kim verifying each rule, and one week of writing and reviewing. It has since been referenced hundreds of times by developers, business analysts, and compliance auditors.
"Three weeks of work that saves hundreds of hours over its lifetime," James said. "The ROI is infinite."
🔵 MedClaim Lesson: Documentation is not a luxury or an afterthought — it is a deliverable. At MedClaim, every major code archaeology project now produces a documentation package following this template. The packages are stored in the same Git repository as the source code, reviewed through pull requests, and updated when the code changes.
41.20 Dead Code Detection and Removal
Legacy programs accumulate dead code over decades — paragraphs that were once active but are no longer PERFORMed, variables that were once used but are now unused, and commented-out code blocks that clutter the program.
Types of Dead Code in COBOL
Unreachable paragraphs: Paragraphs that are never PERFORMed, CALLed, or GO TO'd from any other paragraph. They may have been active in a previous version but were orphaned when the calling logic was changed.
Unused variables: WORKING-STORAGE fields that are defined but never referenced in the PROCEDURE DIVISION. They may have been used by a paragraph that was removed, or they may be remnants of a planned feature that was never implemented.
Commented-out code: Large blocks of commented code are a common legacy pattern. Developers commented out code instead of deleting it, "just in case." After 20 years, nobody remembers what the commented code was for or whether it is still relevant.
Conditional dead code: Code that is technically reachable but can never execute because the condition that triggers it can never be true. For example, an EVALUATE WHEN clause for an account type that no longer exists in the system.
Detecting Dead Code
#!/bin/bash
# dead-code-detector.sh — Find unreachable paragraphs
PROGRAM=$1
echo "=== Dead Code Report for $PROGRAM ==="
# Get all paragraph names
PARAGRAPHS=$(grep "^ [0-9A-Z][0-9A-Z-]*\." "$PROGRAM" | \
awk '{print $1}' | tr -d '.')
# Get the main paragraph (first paragraph)
MAIN_PARA=$(echo "$PARAGRAPHS" | head -1)
echo ""
echo "Unreachable Paragraphs:"
for PARA in $PARAGRAPHS; do
if [ "$PARA" = "$MAIN_PARA" ]; then
continue # Skip the main paragraph
fi
# Check if this paragraph is PERFORMed anywhere
REFS=$(grep -c "PERFORM.*${PARA}" "$PROGRAM" 2>/dev/null)
GOTO_REFS=$(grep -c "GO TO.*${PARA}" "$PROGRAM" 2>/dev/null)
TOTAL=$((REFS + GOTO_REFS))
if [ "$TOTAL" -eq 0 ]; then
LINE=$(grep -n "^ ${PARA}\." "$PROGRAM" | \
head -1 | cut -d: -f1)
echo " WARNING: $PARA (line $LINE) — never called"
fi
done
echo ""
echo "Unused Variables:"
# Check 05-level variables in WORKING-STORAGE
grep " 05 [A-Z][A-Z0-9-]*" "$PROGRAM" | \
awk '{print $2}' | while read VAR; do
# Count references (excluding the definition itself)
REFS=$(grep -c "$VAR" "$PROGRAM")
if [ "$REFS" -le 1 ]; then
LINE=$(grep -n " 05 ${VAR}" "$PROGRAM" | \
head -1 | cut -d: -f1)
echo " WARNING: $VAR (line $LINE) — defined but unused"
fi
done
Safe Dead Code Removal
Removing dead code requires caution. Before deleting:
-
Verify with the compiler cross-reference — do not rely solely on grep. The compiler's XREF listing is authoritative.
-
Check for PERFORM THRU ranges — a paragraph that appears unreachable may be within a PERFORM THRU range and executed implicitly.
-
Check for indirect references — the paragraph might be referenced through a computed GO TO or through a variable-based PERFORM (rare but possible).
-
Test thoroughly — compile and test after every removal. Remove one paragraph or variable at a time.
-
Keep a record — document what you removed and why, in the commit message and in the change log.
* DEAD CODE REMOVED — 2026-03-10 — Maria Chen
* Paragraph 4700-SPECIAL-RECON removed.
* Per Harold Mercer: adjustment accounts from 1995
* conversion have all been closed. Parameter file
* GBANK.GLRECON.ADJPARM also deleted.
* Approval: Change Request CR-2026-0147
⚠️ The "Just In Case" Trap: Developers are often reluctant to remove dead code, reasoning "we might need it later." This is almost always wrong. Dead code is not a safety net — it is a liability. It confuses future developers, clutters search results, and inflates complexity metrics. If the code is in source control, it can always be retrieved from history if needed. Remove dead code aggressively, with proper documentation and testing.
41.21 GlobalBank: The Retirement Knowledge Transfer
When Robert announced his retirement with 18 months' notice, Maria Chen initiated a structured knowledge transfer program. Robert maintained 23 COBOL programs, including several that only he fully understood.
The Knowledge Transfer Schedule
MONTH 1-3: INVENTORY AND PRIORITY
- Robert documented all 23 programs he maintained
- Ranked by risk: HIGH (no one else understands),
MEDIUM (partial understanding), LOW (well-documented)
- 7 programs ranked HIGH, 9 MEDIUM, 7 LOW
MONTH 4-9: HIGH-RISK PROGRAMS
- Robert paired with Derek for 2 HIGH programs
- Robert paired with Jasmine for 2 HIGH programs
- Robert paired with Ananya for 3 HIGH programs
- Each pair: 2 weeks of annotated code review,
1 week of shadowed maintenance, 1 week of solo
maintenance with Robert available for questions
MONTH 10-14: MEDIUM-RISK PROGRAMS
- Same pairing approach, accelerated schedule
- 1 week annotated review, 1 week shadowed, then solo
MONTH 15-17: VALIDATION
- Each transferee independently handles a maintenance
request on their assigned programs
- Robert reviews the change without doing the work
- Any knowledge gaps identified and addressed
MONTH 18: ROBERT'S LAST MONTH
- Final documentation review
- "Robert's Rules" document: a collection of tips,
gotchas, and institutional knowledge that did not
fit anywhere else
- Farewell presentation to the team
Robert's Rules (excerpts)
ROBERT'S RULES — Things I Wish Someone Had Told Me
═══════════════════════════════════════════════════
1. GBGLREC runs after GBPOST but before GBSTMT.
If you ever need to rerun GBGLREC, you MUST also
rerun GBSTMT afterward, because GBSTMT reads the
reconciled balances, not the pre-recon balances.
2. The month-end close job (GBCLOSE) must run BEFORE
midnight on the last business day, not after.
The reason is that it uses FUNCTION CURRENT-DATE
to determine the month, and if it runs at 12:01 AM,
it thinks it is the next month.
3. Never change the SORT parameters in GBSORT01
without checking with the VALIDATE team. The sort
output format is not documented anywhere except
in the GBVALID copybook (TXNSRTCPY).
4. Paragraph 3800 in GBINTCALC has a rounding adjustment
that adds 0.005 before truncating to 2 decimal places.
This is intentional — it implements banker's rounding
required by Fed Regulation CC.
5. If GBARCHIVE fails, do NOT rerun it without first
checking whether the GDG generation was created.
A partial GDG generation can cause the next night's
archive to roll off a good generation.
These five rules took Robert 30 seconds each to write. They encode decades of experience that would have taken his successors months to discover independently — if they discovered them at all.
💡 Key Insight: The most valuable knowledge in any legacy system is the "why," not the "what." The code tells you what it does. Only the people who built and maintained it can tell you why it does it that way. Capturing this knowledge before people leave is the single highest-ROI activity in legacy system management.
41.22 Chapter Summary
Legacy code archaeology is not glamorous, but it is one of the most valuable skills a COBOL developer can possess. The 220 billion lines of COBOL in production today were not all written yesterday — most of them have been running for decades, accumulating business rules, workarounds, and institutional knowledge that exists nowhere except in the code itself.
The systematic approach — external evidence first, then data division survey, then control flow tracing, then data flow analysis, then business rule extraction — turns a daunting wall of uncommented code into a manageable, understandable system. Tools help (grep, ISPF FIND, IBM ADDI), but the most important tool is a disciplined mind that asks the right questions in the right order.
And sometimes, the most important step is picking up the phone and calling Harold Mercer.
Maria Chen's advice to Derek Washington, who was staring at GLRECON for the first time: "Do not try to understand every line. Understand the story. What goes in, what comes out, and what decisions happen in between. The details will fill in as you need them."
Derek's response, after completing his first code archaeology assignment: "I feel like an anthropologist who just deciphered an ancient language. Except the language is still in production and moves three trillion dollars a day."
🔗 Looking Ahead: Chapter 42, the final chapter, looks to the future. What role will COBOL play in 2030 and beyond? How will AI change COBOL development? What does the career path look like? All five themes of this textbook converge in the final chapter.