Chapter 35: AI-Assisted COBOL — Using LLMs for Code Understanding, Documentation Generation, and Assisted Refactoring

DataField.Dev

41 min read

> "The first time I fed a 4,000-line payroll program into an LLM and it produced a paragraph-by-paragraph summary that was 90% accurate, I thought two things: 'This changes everything' and 'That remaining 10% could bankrupt us.'" — Sandra Chen...

Prerequisites

1
32

In This Chapter

35.1 AI Meets the Mainframe: Why Now, What's Different
35.2 Code Understanding: LLMs Reading COBOL
35.3 Documentation Generation: Automating the Undocumented
35.4 Assisted Refactoring: AI as a Cautious Partner
35.5 AI for Testing: Generating What We Should Have Had All Along
35.6 Limitations and Risks: What the Sales Brochure Won't Tell You
35.7 The Human-AI Workflow: Trust but Verify
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 35: AI-Assisted COBOL — Using LLMs for Code Understanding, Documentation Generation, and Assisted Refactoring

"The first time I fed a 4,000-line payroll program into an LLM and it produced a paragraph-by-paragraph summary that was 90% accurate, I thought two things: 'This changes everything' and 'That remaining 10% could bankrupt us.'" — Sandra Chen, Federal Benefits Administration

Spaced Review: Before we begin, recall two concepts that will intersect deeply with AI-assisted development. From Chapter 3, remember how the Language Environment manages runtime services — AI tools frequently misunderstand LE conventions, producing code that compiles but fails at runtime. From Chapter 32, recall our modernization strategy framework — AI tools are accelerators within a strategy, never the strategy itself.

35.1 AI Meets the Mainframe: Why Now, What's Different

For thirty years, the mainframe modernization conversation was dominated by two camps: the "rip and replace" advocates who wanted to rewrite everything in Java, and the "if it ain't broke" defenders who resisted any change. Both camps missed something fundamental. Starting around 2023, a third option emerged that neither side anticipated — using artificial intelligence to make the existing COBOL systems more understandable, better documented, and incrementally improvable without wholesale replacement.

The Convergence of Three Forces

Three forces converged to make AI-assisted COBOL practical rather than theoretical.

Force One: The Knowledge Crisis. The retirement wave we've discussed throughout this book is not a future problem — it is a present emergency. At CNB, Kwame Mensah estimates that 40% of the institutional knowledge about their core banking system exists only in the heads of developers who will retire within five years. Rob Chen's deep understanding of the batch settlement logic, Lisa Park's encyclopedic knowledge of the copybook hierarchies — this knowledge is walking out the door. The question is no longer whether to use AI tools but how to use them before it's too late.

Force Two: LLM Capabilities with Structured Languages. Large language models trained on code have a somewhat paradoxical relationship with COBOL. On one hand, COBOL is underrepresented in training data compared to Python or JavaScript. On the other hand, COBOL's verbose, English-like syntax and rigid structure make it more amenable to AI comprehension than many modern languages. A well-structured COBOL program reads almost like pseudocode. The paragraph names, the PERFORM structures, the explicit data definitions in the DATA DIVISION — these give AI models more context to work with than a terse Rust program with implicit types and complex lifetime annotations.

Force Three: Enterprise AI Investment. IBM, Broadcom, and the major mainframe vendors have all invested heavily in AI-assisted development tools. IBM's watsonx Code Assistant for Z (CA4Z 2.x by 2026), Broadcom's AI capabilities in their DevOps toolchain, and tools built on the current frontier models (GPT-5, Claude 4.x, Gemini 2.5) and their open-source peers have created an ecosystem that didn't exist even two years ago. These aren't research prototypes — they're production tools being used at scale.

What AI Can and Cannot Do with COBOL

Let me be direct about the current state of affairs, because the vendor marketing materials will not be:

What AI does well with COBOL: - Summarizing what a program or paragraph does in plain English - Generating documentation from code structure - Identifying patterns in data flow and control flow - Suggesting test cases based on code logic - Finding dead code and unreachable paragraphs - Translating COBOL idioms for developers who know modern languages - Generating boilerplate code for standard patterns

What AI does poorly with COBOL: - Understanding system-level context (JCL dependencies, CICS transaction flow, DB2 plan binds) - Handling implicit behaviors (COBOL's many default behaviors that aren't in the source) - Recognizing business rules encoded in data values rather than program logic - Understanding the implications of COMP-3 vs. COMP vs. DISPLAY formatting - Correctly handling the interactions between programs in a batch chain - Understanding site-specific naming conventions and their semantic meaning - Anything involving timing, resource contention, or concurrency on z/OS

Yuki Tanaka at SecureFirst puts it well: "The AI is like a brilliant new hire who's read every COBOL textbook ever written but has never touched a production system. Technically excellent, contextually naive."

The Tool Landscape

Understanding the current tool landscape helps you make informed choices. The tools fall into three categories:

Vendor-specific tools are built by mainframe platform vendors and deeply integrated with their ecosystems. IBM's watsonx Code Assistant for Z is the most prominent, offering COBOL code explanation, test generation, and Java transformation assistance integrated with IBM's IDz IDE and z/OS development environment. Broadcom's AI-powered features in their DevOps toolchain provide code analysis within Endevor-based workflows. These tools have the advantage of deep platform integration but the disadvantage of vendor lock-in and often limited model flexibility.

General-purpose LLMs — by 2026 the frontier is GPT-5, Claude 4.x (Opus and Sonnet), Gemini 2.5 Pro, and their open-source peers (Llama 4, DeepSeek V3/V4) — are not COBOL-specific but are remarkably capable with COBOL due to their broad training. Their advantage is flexibility and rapid improvement; their disadvantage is lack of mainframe-specific knowledge (JCL, CICS, VSAM nuances). They work best for code comprehension and documentation, less well for platform-specific code generation.

Open-source and community tools are emerging from the Open Mainframe Project and community efforts. These include prompt libraries optimized for COBOL analysis, evaluation harnesses for measuring AI accuracy on COBOL tasks, and integration frameworks that connect general-purpose LLMs to mainframe development workflows. They're less polished than commercial tools but offer transparency and customizability.

At CNB, Kwame's team evaluated all three categories and settled on a hybrid approach: IBM watsonx for build-integrated code analysis (where deep z/OS integration matters), Claude for documentation generation and code explanation (where language quality and reasoning depth matter), and community prompt templates as the starting point for their own prompt library. This hybrid approach avoids vendor lock-in while leveraging each tool's strengths.

The HA Banking System Context

Throughout this chapter, we'll apply AI-assisted techniques to our progressive project — the High-Availability Banking Transaction Processing System. This is exactly the kind of system where AI tools shine: large, complex, business-critical code that needs better documentation and careful modernization. We will not, however, use AI as a shortcut. Every AI-generated artifact will go through the same rigorous review process we'd apply to any production change.

A word of warning before we proceed: this chapter will give you powerful tools. Used well, they will make you and your team dramatically more productive. Used carelessly, they will introduce subtle defects into production systems that process real money for real people. The difference between "used well" and "used carelessly" is the rigor of your review process. Never skip the review. Never assume the AI is right. Never deploy AI-generated code or documentation without human verification. If you take away only one thing from this chapter, let it be this: AI-assisted does not mean AI-autonomous.

35.2 Code Understanding: LLMs Reading COBOL

The most immediately valuable application of AI to COBOL systems is code comprehension. When a developer inherits a 10,000-line program with no documentation and the original author retired in 2008, an AI that can explain what the code does — even imperfectly — provides enormous value.

Paragraph-Level Summarization

The fundamental unit of AI-assisted code understanding is the paragraph summary. You feed the AI a COBOL paragraph and ask it to explain what the paragraph does. Here's a real-world example from CNB's deposit processing system:

       2100-VALIDATE-ACCT-STATUS.
           MOVE 'N' TO WS-VALID-FLAG
           EVALUATE WS-ACCT-STATUS-CD
               WHEN 'A'
                   IF WS-ACCT-HOLD-CD = SPACES
                       MOVE 'Y' TO WS-VALID-FLAG
                   ELSE
                       IF WS-HOLD-OVERRIDE-AUTH = 'Y'
                           MOVE 'Y' TO WS-VALID-FLAG
                           ADD 1 TO WS-HOLD-OVERRIDE-CTR
                       ELSE
                           MOVE 'ACCT ON HOLD' TO WS-ERROR-MSG
                           ADD 1 TO WS-HOLD-REJECT-CTR
                       END-IF
                   END-IF
               WHEN 'D'
                   MOVE 'ACCT DORMANT' TO WS-ERROR-MSG
                   ADD 1 TO WS-DORMANT-CTR
               WHEN 'C'
                   MOVE 'ACCT CLOSED' TO WS-ERROR-MSG
                   ADD 1 TO WS-CLOSED-CTR
               WHEN OTHER
                   MOVE 'INVALID STATUS' TO WS-ERROR-MSG
                   ADD 1 TO WS-ERROR-CTR
           END-EVALUATE.

A well-prompted AI will produce something like:

"Paragraph 2100-VALIDATE-ACCT-STATUS validates whether a bank account is eligible for processing. It checks the account status code: Active accounts ('A') pass validation unless they have a hold, in which case an override authorization is required. Dormant ('D'), Closed ('C'), and invalid status codes are rejected with appropriate error messages. The paragraph maintains separate counters for each rejection reason and hold overrides, likely for batch reporting."

That summary is accurate and useful. A developer unfamiliar with this codebase can read it and understand the paragraph's purpose in seconds rather than minutes.

Effective Prompting for COBOL Comprehension

The quality of AI output depends heavily on the quality of the prompt. Through extensive experimentation at CNB, Pinnacle, Federal Benefits, and SecureFirst, we've identified the prompt patterns that produce the best results for COBOL comprehension.

Pattern 1: Context-First Prompting

Always provide context before the code. The AI needs to know what kind of system it's looking at:

System Context: This is a COBOL batch program running on z/OS that
processes daily deposit transactions for a commercial bank. It reads
from a VSAM KSDS keyed by account number and writes to a DB2 table.

Paragraph Context: This paragraph is called from the main processing
loop after the input record has been validated for format but before
the account balance is updated.

[COBOL code here]

Task: Explain what this paragraph does, identify any business rules
it implements, and note any potential issues or edge cases.

Pattern 2: Copybook-Inclusive Prompting

COBOL programs are meaningless without their copybooks. Always include the relevant data definitions:

The following copybook defines the account master record:
[COPY ACCTMSTR code]

The following working storage defines the processing flags:
[relevant WS fields]

Given these definitions, explain what the following paragraph does:
[COBOL code]

Without the copybook, the AI is guessing what WS-ACCT-STATUS-CD means. With it, the AI knows the field is PIC X(1) and can reason about its valid values.

Pattern 3: Chain-of-Paragraphs Analysis

For understanding control flow, feed the AI a sequence of related paragraphs and ask it to trace the execution path:

The following paragraphs are called in sequence from paragraph
2000-PROCESS-TRANSACTION via PERFORM statements. Trace the complete
processing flow and identify the business rules implemented across
all paragraphs:

[Multiple paragraphs]

This produces better results than analyzing paragraphs in isolation because the AI can see how data flows between them.

Pattern 4: Role-Based Prompting

Assigning the AI a role can significantly improve output quality for COBOL analysis:

You are a senior COBOL systems analyst with 25 years of experience
in z/OS banking systems. You are reviewing code written by another
developer and producing documentation for the maintenance team.
Your documentation should be precise, technically accurate, and
assume the reader understands COBOL but not this specific system.

This pattern produces more focused, technically rigorous output than generic prompting because the AI adopts the communication style and knowledge assumptions appropriate for the role.

Pattern 5: Iterative Refinement

Don't expect perfect output on the first attempt. The most effective workflow is iterative:

First pass: Broad summary with the context-first pattern
Review the output and identify gaps or errors
Second pass: Targeted questions about the gaps ("You mentioned the paragraph validates the account. What happens if the account type field is spaces?")
Third pass: Ask the AI to consolidate its analysis into final documentation

At Pinnacle, Ahmad found that three-pass analysis consistently produced better documentation than a single detailed prompt. "The first pass gives me the big picture. The second pass catches what it missed. The third pass pulls it together into something I'd actually put in our documentation system."

Whole-Program Analysis

Beyond individual paragraphs, AI tools can analyze entire programs to produce structural summaries. For a 5,000-line COBOL program, you can ask the AI to produce:

A program structure diagram showing the PERFORM hierarchy
A list of all external interfaces (files, databases, called programs)
A summary of the main processing logic in numbered steps
A catalog of all business rules with the paragraph where each is implemented

The key to whole-program analysis is managing the AI's context window. A 5,000-line COBOL program plus its copybooks may exceed the context window of some models. In that case, you need to chunk the analysis — feed the DATA DIVISION first to establish field definitions, then feed the PROCEDURE DIVISION in logical sections (the main loop, the validation paragraphs, the update paragraphs, etc.).

Lisa Park at CNB developed a chunking strategy that she calls "top-down analysis":

Feed the AI only the PERFORM statements from the main paragraph (the "spine" of the program)
Ask for a high-level processing flow based on paragraph names and PERFORM sequence
Then drill into each major section with the full paragraph source
Finally, ask the AI to reconcile its section-level analysis into a coherent whole-program summary

This top-down approach produces better results than feeding the entire program at once because it mirrors how a human analyst would read the code — starting with the high-level flow and drilling down into details.

Data Flow Analysis

One of the most powerful applications is tracing how a data item moves through a program. Consider asking an AI to trace every reference to a specific field:

In the following COBOL program, trace every reference to the field
WS-NET-AMOUNT. For each reference, indicate whether it's a read,
write, or both, what paragraph it occurs in, and what the business
significance of that operation is.

For the HA banking system, this technique is invaluable. Ahmad at Pinnacle used it to trace the flow of the transaction amount field through their settlement program and discovered three redundant validation checks that had been added by different developers over the years — each apparently unaware the previous check existed. The three checks were not identical — they had slightly different thresholds and error messages — which raised the question of which was the authoritative validation. Ahmad's analysis led to a consolidation that eliminated 47 lines of dead logic and made the validation behavior consistent and predictable.

Data flow analysis is also invaluable for security auditing. By tracing sensitive fields (Social Security numbers, account numbers, passwords) through a program, you can identify every location where the data is stored, transmitted, or displayed — and verify that appropriate protection (masking, encryption, access control) is applied at each point. At Federal Benefits, Sandra's team used AI data flow analysis to audit their beneficiary PII handling and found two programs that wrote unmasked Social Security numbers to a print file. The finding led to an immediate remediation that resolved a potential FISMA compliance violation.

Cross-Program Analysis

The real power emerges when you analyze how programs interact. In a batch chain where Program A writes a file that Program B reads, understanding the contract between them requires understanding both programs. Feed the AI both programs' FILE SECTION and PROCEDURE DIVISION, along with the JCL that chains them, and ask it to describe the interface contract.

At Federal Benefits, Sandra's team used this approach to document the interfaces between seventeen programs in their eligibility determination chain. What had been three months of manual analysis was compressed into two weeks of AI-assisted documentation followed by two weeks of expert review and correction.

The 10% Problem

Remember Sandra's quote at the opening of this chapter. The 90% accuracy sounds impressive until you realize the 10% error rate in a financial system is catastrophic. Common AI errors in COBOL comprehension include:

Misunderstanding REDEFINES: The AI may not recognize that two differently-named fields occupy the same storage, leading to incorrect data flow analysis.

Missing implicit COMPUTE truncation: When the AI summarizes arithmetic, it may not account for the fact that COBOL truncates rather than rounding by default, or that the target field size determines the precision.

Ignoring COPY REPLACING: When copybooks are included with REPLACING clauses, the AI sometimes analyzes the original copybook text rather than the replaced version.

Overlooking condition names (88-levels): The AI may describe a check as "IF WS-ACCT-TYPE = '3'" rather than recognizing that value '3' corresponds to 88-level ACCT-IS-SAVINGS, which gives the check its business meaning.

PERFORM THRU misinterpretation: The AI may not realize that PERFORM para-A THRU para-A-EXIT executes everything between those two paragraph labels, including any paragraphs defined in between.

Every AI-generated comprehension artifact must be reviewed by someone who understands both COBOL and the business domain. This is not optional.

Practical Accuracy Benchmarks

Based on experience across all four anchor organizations, here are the accuracy benchmarks you should expect for different types of AI comprehension tasks:

Task	Expected Accuracy	Common Error Types
Paragraph summary (with copybooks)	85-92%	Business context, implicit behavior
Paragraph summary (without copybooks)	55-70%	Field type assumptions, wrong data semantics
Data flow trace (single program)	78-85%	Missing REDEFINES, group MOVEs, CORRESPONDING
Data flow trace (cross-program)	60-75%	Missing dynamic CALLs, file-based interfaces
Control flow analysis	85-90%	PERFORM THRU ranges, SECTION-level flow
Business rule identification	70-80%	Data-value encoded rules, historical context

These benchmarks are based on production usage, not vendor claims. Your results may vary depending on the AI model, the quality of your prompts, the complexity of your code, and the amount of context you provide. Track your own accuracy metrics and adjust your process accordingly.

The accuracy difference between "with copybooks" and "without copybooks" deserves special emphasis. Including the relevant data definitions in your prompt can improve accuracy by 20-30 percentage points. This is the single highest-leverage improvement you can make to your AI-assisted comprehension workflow.

35.3 Documentation Generation: Automating the Undocumented

If code comprehension is the most immediately valuable AI application, documentation generation is the most immediately scalable. Most mainframe shops have thousands of programs with minimal or outdated documentation. AI can generate draft documentation at a pace no human team can match.

Program-Level Documentation

A well-structured prompt can generate comprehensive program documentation:

Generate documentation for this COBOL program in the following format:

1. PROGRAM OVERVIEW: One paragraph describing the program's purpose
2. INPUT/OUTPUT: List all files, databases, and queues accessed
3. PROCESSING LOGIC: Numbered steps describing the main flow
4. BUSINESS RULES: List all business rules implemented
5. ERROR HANDLING: How errors are detected and reported
6. DEPENDENCIES: Other programs, copybooks, and system resources
7. DATA STORES: Tables, files, and queues with access patterns

[Full COBOL source]

The output will be a draft — not a finished document. But it's a draft that captures 80-90% of what a reader needs to know, and editing a draft is dramatically faster than writing from scratch.

Copybook Annotation

Copybooks are where business meaning lives in COBOL systems, and they're almost never documented adequately. AI excels at annotating copybooks because the field names, sizes, and relationships provide rich context:

      * AI-GENERATED DOCUMENTATION — REVIEW REQUIRED
      * Copybook: ACCTMSTR - Account Master Record
      * Used by: DPST1000, WDRL2000, STMT3000, ACCT4000
      * DB2 Table: BANKDB.ACCOUNT_MASTER
      *
       01  ACCT-MASTER-REC.
      *    Primary key - 10-digit account number
           05  ACCT-NUMBER           PIC 9(10).
      *    Account classification
      *    Values: C=Checking, S=Savings, M=Money Market,
      *            T=Certificate, L=Loan
           05  ACCT-TYPE-CD          PIC X(1).
               88  ACCT-IS-CHECKING      VALUE 'C'.
               88  ACCT-IS-SAVINGS       VALUE 'S'.
               88  ACCT-IS-MONEY-MKT     VALUE 'M'.
               88  ACCT-IS-CERT          VALUE 'T'.
               88  ACCT-IS-LOAN          VALUE 'L'.
      *    Current account status
      *    Values: A=Active, D=Dormant, C=Closed, F=Frozen
           05  ACCT-STATUS-CD        PIC X(1).
               88  ACCT-IS-ACTIVE        VALUE 'A'.
               88  ACCT-IS-DORMANT       VALUE 'D'.
               88  ACCT-IS-CLOSED        VALUE 'C'.
               88  ACCT-IS-FROZEN        VALUE 'F'.
      *    Balance fields - all COMP-3 for efficient storage
      *    Current available balance (may differ from ledger
      *    due to pending transactions and holds)
           05  ACCT-AVAIL-BAL        PIC S9(13)V99 COMP-3.
      *    Ledger balance - official balance of record
           05  ACCT-LEDGER-BAL       PIC S9(13)V99 COMP-3.

This kind of annotation is enormously valuable for developers who didn't grow up with the system. The AI infers the meaning from naming conventions, 88-levels, and field characteristics — then a reviewer confirms or corrects.

Batch Job Documentation

JCL is notoriously opaque, and AI can help translate it into understandable documentation:

Given the following JCL, produce a plain-English description of what
this batch job does, what datasets it uses, what programs it calls in
what order, and what the restart/recovery strategy is.

At CNB, Rob Chen used this approach to document their end-of-day batch cycle — 47 JCL procedures with complex dependencies. The AI-generated documentation became the foundation for their disaster recovery runbook, after Rob spent a week correcting the roughly 15% of details the AI got wrong (mostly around GDG generation management and conditional step execution).

The Documentation Pipeline

For large-scale documentation projects, you need a systematic pipeline, not ad-hoc prompting. Here's the pipeline that Sandra's team at Federal Benefits developed:

Stage 1: Inventory. Catalog all programs, copybooks, and JCL. Identify which programs have existing documentation and which don't. Prioritize by business criticality and knowledge risk (how close is the subject matter expert to retirement?).

Stage 2: Context Gathering. For each program, collect the source code, all referenced copybooks, the JCL that executes it, and any existing documentation however fragmentary.

Stage 3: AI Generation. Run each program through the documentation prompt templates (see code/example-01). Generate program summaries, copybook annotations, data flow diagrams (as text), and interface documentation.

Stage 4: Expert Review. Route each generated document to the subject matter expert for that program. The expert's job is to correct errors, add business context the AI missed, and flag anything dangerous. Track corrections for feedback into prompt refinement.

Stage 5: Publication. Integrate the reviewed documentation into the team's documentation system. Link programs to their documentation. Establish a process for keeping documentation updated as code changes.

Sandra reports that this pipeline produces reviewed, published documentation at approximately five times the rate of purely manual documentation. The AI doesn't replace the expert — it gives the expert a solid draft to work from instead of a blank page.

Cross-Reference Documentation

One of the most valuable but underutilized applications of AI documentation is cross-reference generation. In a system with hundreds of programs and copybooks, knowing "which programs use this copybook" or "which copybooks does this program include" is essential for impact analysis.

AI can generate and maintain these cross-references by analyzing COPY statements, CALL statements, and file access patterns across the entire codebase:

Analyze all COBOL source programs in the following list and produce:
1. A copybook usage matrix: for each copybook, list every program
   that includes it (directly or via nested COPY)
2. A program call graph: for each program, list every program it
   CALLs (static and dynamic where determinable)
3. A file usage inventory: for each file (identified by DD name or
   SELECT clause), list every program that reads or writes it
4. A DB2 object reference map: for each DB2 table or view, list
   every program that accesses it and the type of access (SELECT,
   INSERT, UPDATE, DELETE)

This cross-reference documentation is the foundation for impact analysis. When someone proposes changing a copybook field, the cross-reference tells you immediately which programs need recompilation and testing. When a DB2 table needs restructuring, you know exactly which programs are affected.

At CNB, Kwame's team generated these cross-references for their entire 2,147-program portfolio. The cross-reference revealed surprising dependencies — programs that appeared unrelated actually shared copybooks through deeply nested COPY chains. This discovery alone justified the documentation effort, because it identified coupling that had been invisible for decades.

Documentation Maintenance: The Continuous Challenge

The hardest part of documentation is not generating it — it's keeping it current. Documentation that was accurate six months ago but doesn't reflect recent changes is worse than no documentation, because it creates false confidence.

AI-assisted documentation has a natural advantage here: it can be regenerated cheaply. When a program changes, the documentation pipeline can re-run the AI analysis on the changed program, produce an updated draft, and flag the sections that differ from the previous version for targeted review.

The workflow for documentation maintenance is:

Trigger: A program is modified and committed to the source repository
Detect: The CI/CD pipeline identifies which programs changed (Chapter 36)
Regenerate: The documentation pipeline runs the AI analysis on changed programs
Diff: The new documentation is compared to the existing documentation
Flag: Changed sections are highlighted for review
Review: The developer who made the code change reviews the documentation changes (they're the most qualified reviewer because they know what changed and why)
Publish: The reviewed documentation replaces the previous version

This turns documentation from a project (big effort, done once, decays immediately) into a process (small effort per change, always current). It's the only sustainable approach for large COBOL systems.

35.4 Assisted Refactoring: AI as a Cautious Partner

Refactoring production COBOL is the most dangerous application of AI assistance, and the one that requires the most rigorous human oversight. The difference between AI-assisted comprehension (read-only) and AI-assisted refactoring (write) is the difference between reading a map and performing surgery. Both require skill, but only one can kill the patient.

Dead Code Detection

The safest refactoring application is dead code detection — identifying code that can never execute. AI is quite good at this because it involves control flow analysis rather than semantic understanding:

Analyze the following COBOL program and identify:
1. Paragraphs that are never PERFORMed or called
2. Conditional branches that can never be true given the data definitions
3. Variables that are defined but never referenced
4. Code after unconditional STOP RUN or GOBACK statements

At Pinnacle, Diane's team ran dead code detection on their 200 largest programs and found an average of 12% dead code per program. The highest was a 15,000-line claims processing program where 34% of the code was dead — remnants of Y2K remediation, abandoned features, and workarounds for bugs that had been fixed differently. Removing that dead code didn't change a single byte of the executable but made the remaining code vastly easier to understand.

Critical caution: Always verify dead code detection results against the complete calling chain. A paragraph might appear unused in one program but be dynamically CALLed from another via a computed entry point. The AI cannot see these cross-program relationships unless you provide them.

Naming Improvement

COBOL programs written in the 1980s and 1990s often have cryptic field names driven by the constraints of that era — WS-FLD01, WS-FLD02, WS-AMT-X, WS-SAVE-1. AI can suggest more meaningful names:

The following COBOL program uses several poorly-named fields. Based on
how each field is used in the program logic, suggest more descriptive
names. Maintain COBOL naming conventions (hyphens, no underscores,
30-character limit). For each suggestion, explain your reasoning.

Current name: WS-FLD01
Suggested: WS-DEPOSIT-VALID-FLAG
Reasoning: This field is set to 'Y' in the validation paragraph when
all deposit checks pass, and tested in the processing paragraph to
determine if the deposit should be applied.

This is a high-value, moderate-risk activity. The names need to be correct (wrong names are worse than cryptic names), and the REPLACING must be applied consistently across every program and copybook that references the field. AI suggests; humans verify and apply.

Structure Modernization

AI can suggest structural improvements to make COBOL programs more maintainable:

GOTO Elimination: Identify GOTO-based control flow and suggest equivalent structured constructs using EVALUATE, PERFORM, and IF/ELSE.

Paragraph Extraction: Identify sections of code within a paragraph that perform a distinct function and could be extracted into their own paragraph for clarity.

EVALUATE Conversion: Convert nested IF chains to EVALUATE statements where appropriate.

Inline PERFORM Conversion: Convert out-of-line PERFORMs of small paragraphs to inline PERFORMs where it improves readability.

For each suggestion, the AI should generate both the original and proposed code so the developer can see exactly what changes. Here's an example of GOTO elimination:

Original:

       3000-PROCESS-PAYMENT.
           IF WS-PAY-AMT < ZERO
               GO TO 3000-EXIT
           END-IF
           IF WS-ACCT-FROZEN
               MOVE 'FROZEN' TO WS-REJECT-REASON
               GO TO 3000-REJECT
           END-IF
           IF WS-PAY-AMT > WS-DAILY-LIMIT
               MOVE 'OVERLIMIT' TO WS-REJECT-REASON
               GO TO 3000-REJECT
           END-IF
           PERFORM 3100-APPLY-PAYMENT
           GO TO 3000-EXIT.
       3000-REJECT.
           PERFORM 3200-LOG-REJECTION.
       3000-EXIT.
           EXIT.

AI-Suggested Restructure:

       3000-PROCESS-PAYMENT.
           EVALUATE TRUE
               WHEN WS-PAY-AMT < ZERO
                   CONTINUE
               WHEN WS-ACCT-FROZEN
                   MOVE 'FROZEN' TO WS-REJECT-REASON
                   PERFORM 3200-LOG-REJECTION
               WHEN WS-PAY-AMT > WS-DAILY-LIMIT
                   MOVE 'OVERLIMIT' TO WS-REJECT-REASON
                   PERFORM 3200-LOG-REJECTION
               WHEN OTHER
                   PERFORM 3100-APPLY-PAYMENT
           END-EVALUATE.

This is cleaner, but notice that the AI has subtly changed the semantics. In the original, the negative amount case skips both processing and rejection logging. In the restructured version, the CONTINUE preserves that behavior. A less careful AI (or a less careful prompt) might have lumped the negative case in with the rejections. This is why human review is not optional.

Code Consolidation

Another valuable refactoring pattern is code consolidation — identifying duplicated logic across multiple programs and extracting it into shared subroutines. AI excels at pattern recognition across codebases:

Analyze the following five COBOL programs and identify any paragraphs
or code sections that implement substantially similar logic. For each
group of similar code sections, describe:
1. The programs and paragraphs where the pattern appears
2. What the common logic does
3. How the implementations differ (if at all)
4. Whether the common logic could be extracted into a shared
   subroutine, and what the interface (parameters) would be

At Federal Benefits, Sandra's team used this approach to discover that account validation logic — checking whether an account is active, not frozen, not dormant — was implemented in seventeen different programs with eleven slightly different variations. Some programs checked for all states; some checked for only a subset. Some programs treated frozen accounts as errors; others allowed frozen accounts with override authorization. The AI identified the pattern; Sandra's team then worked with the business analysts to determine which variation was correct, and consolidated the seventeen implementations into a single shared validation subroutine with a well-defined interface.

The consolidation reduced the total codebase by 2,400 lines and, more importantly, ensured that validation was consistent across all programs. When the next regulatory change required a new account status check, it had to be implemented in one place rather than seventeen.

The Refactoring Review Checklist

Every AI-suggested refactoring must pass this checklist before implementation:

Semantic Equivalence: Does the refactored code produce exactly the same outputs for all possible inputs? Not "probably the same" — exactly the same.
COMP-3/COMP Impact: Do any data movements change? COBOL data movement rules are context-sensitive. Moving a COMP-3 to DISPLAY versus moving it to another COMP-3 are different operations.
Condition Code Preservation: Does the refactored code set all the same condition codes and return codes?
Side Effect Preservation: Does the refactored code perform all the same file I/O, database operations, and CALL statements in the same order?
Performance Impact: Does the refactoring change the computational characteristics? Restructuring a loop can change performance dramatically in batch.
Abend Behavior: If the original code abends under certain conditions, does the refactored code abend in the same way? Sometimes abends are expected and handled by upstream processes.

35.5 AI for Testing: Generating What We Should Have Had All Along

One of the painful realities of legacy COBOL systems is the absence of automated tests. Programs that process billions of dollars in transactions often have no formal test suites — they were tested manually when written in 1987 and have been "tested in production" ever since. AI can help close this gap.

Test Case Generation

Given a COBOL paragraph, an AI can identify the test cases that should exist:

Analyze the following COBOL paragraph and generate a comprehensive
set of test cases. For each test case, specify:
1. Test case ID and description
2. Input values for all referenced fields
3. Expected output values
4. The specific logic branch being tested
5. Whether this is a positive test, negative test, or boundary test

Be sure to include:
- All branches of every IF and EVALUATE
- Boundary values for numeric fields (zero, maximum, minimum, one-off)
- Empty/space values for alphanumeric fields
- Tests for COBOL-specific behaviors (truncation, sign handling)

For the 2100-VALIDATE-ACCT-STATUS paragraph shown earlier, the AI would generate test cases for: active account with no hold, active account with hold and override authorized, active account with hold and no override, dormant account, closed account, and invalid status code. It should also generate boundary cases: status code of space, status code of LOW-VALUES, and hold code with leading/trailing spaces.

Test Data Creation

Generating test data for COBOL programs is laborious because the data must conform to specific copybook layouts, COMP-3 encoding, and packed decimal formats. AI can help generate test data specifications:

Given the following copybook, generate 20 test records that cover a
representative range of scenarios. Include valid records, records with
boundary values, and records with various error conditions. Output as
a table showing field name, value, and purpose of each test record.

The AI generates the logical data. A utility program or script then converts the logical values into the physical format required by the COBOL program. This separation is important — you don't want the AI generating raw hex values for COMP-3 fields, because that's where errors creep in.

Regression Test Suites

For the HA banking system, the most valuable testing application is generating regression test suites that capture the current behavior of the system. These tests don't verify that the system is correct — they verify that the system still behaves the same way after changes.

The approach works like this:

AI analyzes each paragraph and generates test scenarios
Human experts review and supplement the scenarios with domain knowledge
The test scenarios are implemented using a COBOL testing framework (IBM zUnit, Micro Focus Unit Testing Framework, or a homegrown harness)
The tests are run against the current production code to establish baselines
Future changes must pass all existing tests plus any new tests specific to the change

At SecureFirst, Carlos's team generated regression tests for their insurance claims processing system using this approach. They started with 4,200 AI-generated test cases, which expert review reduced to 3,100 (removing duplicates, correcting errors, and adding 200 cases the AI missed). Those 3,100 tests now run automatically every night, catching three regression bugs in the first month that would have reached production under the old manual testing process.

The three bugs caught were illuminating. The first was a classic COBOL truncation error — a field overflow that occurred only when the claim amount exceeded $99,999.99 (the field was PIC S9(5)V99). The second was a leap year date calculation error that would have manifested only on February 29. The third was a race condition in a CICS transaction that occurred only when two users updated the same claim within the same CICS syncpoint interval. None of these would have been caught by the typical manual testing approach of "enter a few transactions and check the screen."

The economics of AI-generated regression tests are compelling. The initial generation and review took three developer-weeks. The nightly automated execution costs approximately 15 CPU-minutes on the test LPAR. In the first month alone, the three caught bugs would have cost an estimated $180,000 in production remediation, customer communication, and regulatory reporting. The test suite paid for itself before the first month was over.

Test Oracle Problem

The fundamental challenge of AI-generated tests is the oracle problem: how do you know the expected result is correct? If the AI misunderstands the code, it may generate tests that verify the wrong behavior. This is especially treacherous with COBOL's numeric handling:

       COMPUTE WS-RESULT = WS-AMT-A / WS-AMT-B

If WS-RESULT is PIC S9(7)V99 and the actual quotient is 12345.6789, COBOL stores 12345.67 (truncated, not rounded, unless ROUNDED is specified). An AI might generate a test expecting 12345.68 (rounded) and the test would pass development review because the rounded value "looks right." This kind of subtle error is exactly what makes AI-assisted testing powerful and dangerous simultaneously.

Building a Test Harness Strategy

For the HA banking system, a systematic approach to AI-assisted test generation requires a test harness infrastructure. The test harness is the framework that executes test cases against COBOL paragraphs in isolation. Without a harness, AI-generated test case specifications are theoretical — they describe what to test but can't actually test it.

The test harness architecture has four components:

Component 1: Test Case Loader. A COBOL program (or REXX exec) that reads test case definitions from a file or dataset. Each test case definition specifies the target program, the target paragraph, the input field values, and the expected output values.

Component 2: Environment Setup. Before each test case executes, the harness sets up WORKING-STORAGE fields, LINKAGE SECTION parameters, and any mocked external resources (files, DB2 cursors, CICS resources) to the values specified in the test case.

Component 3: Executor. The harness PERFORMs the target paragraph and captures the state of all output fields after execution. For paragraphs that perform I/O, the harness intercepts the I/O operations and records what would have been read or written.

Component 4: Comparator. The harness compares actual output to expected output and reports pass/fail for each field. For numeric fields, the comparison must account for COBOL's truncation and sign handling — the comparator must understand COMP-3, COMP, and DISPLAY field semantics.

Building this harness is a significant upfront investment, but once it exists, adding new test cases is simply a matter of writing test case definition records — which is exactly what the AI generates. The harness converts AI-generated test specifications into executable, repeatable automated tests.

Rob Chen at CNB spent three weeks building the harness for their system. It now runs 3,400 test cases in under four minutes. "The harness was the boring part," Rob says. "Getting the AI to generate good test cases was the interesting part. But without the harness, the test cases would just be documents in a drawer."

35.6 Limitations and Risks: What the Sales Brochure Won't Tell You

This section is the most important in the chapter. If you take nothing else away, take this: AI tools for COBOL are genuinely useful, but they can produce confident, plausible, completely wrong output. The consequences in mainframe systems are measured in dollars, regulatory fines, and careers.

Hallucination in the COBOL Context

LLMs hallucinate. They generate plausible-sounding text that is factually incorrect. In the COBOL context, hallucination takes specific forms:

Invented syntax: The AI may generate COBOL statements that look reasonable but don't exist. Example: INSPECT WS-FIELD CONVERTING SPACES TO ZEROS WITH POINTER WS-PTR. The WITH POINTER clause is not valid on INSPECT CONVERTING. The program won't compile, so this is caught immediately — but it wastes developer time.

Plausible but wrong logic: More dangerous is when the AI generates code that compiles and runs but implements the wrong logic. Example: The AI might generate a date comparison using simple numeric comparison of YYYYMMDD fields, which works correctly for most dates but fails on leap year boundaries when the month/day portion wraps around. The code compiles, runs, and passes most tests.

Phantom features: The AI may reference COBOL features that exist in one dialect but not another. Enterprise COBOL for z/OS, Micro Focus COBOL, and GnuCOBOL all have different extensions. The AI may suggest a Micro Focus extension when you're compiling with Enterprise COBOL.

Wrong version features: Even within Enterprise COBOL, features vary by version. JSON GENERATE was introduced in V6.1. The AI may suggest it for a V5.1 shop.

Subtle Bugs: The COBOL-Specific Gotchas

These are the bugs that make veteran COBOL developers break out in cold sweats:

COMP-3 sign handling: AI-generated code may not correctly handle the sign nibble in COMP-3 fields. A positive COMP-3 value has a sign nibble of C or F; negative has D. Some operations distinguish between C and F, others don't. The AI almost certainly doesn't understand this distinction.

Reference modification off-by-one: COBOL reference modification is 1-based. WS-FIELD(1:3) starts at position 1. AI trained primarily on 0-based languages may generate prompts or test cases that assume 0-based indexing.

MOVE semantics variance: Moving an alphanumeric to a numeric field invokes de-editing rules. Moving between COMP and COMP-3 involves conversion. Moving a longer field to a shorter one truncates differently depending on the field types. AI rarely accounts for all these rules correctly.

EVALUATE with ALSO: The ALSO clause in EVALUATE creates a truth table that must be understood as a matrix, not a sequence. AI frequently gets multi-ALSO conditions wrong.

Scope terminator ambiguity: In legacy code without explicit scope terminators (END-IF, END-EVALUATE), the AI may misparse which IF an ELSE belongs to. The COBOL compiler uses the "match with nearest" rule, but AI trained on indented code may follow the indentation rather than the rule.

Security Concerns

Using AI with COBOL source code raises legitimate security concerns:

Data exposure: Sending production COBOL source to a cloud-based AI service may violate data security policies. COBOL programs often contain embedded business rules that are trade secrets, account number formats that are sensitive, and sometimes even hardcoded credentials (yes, they still exist in legacy code).

Supply chain risk: If you use AI-generated code in production, you're introducing code from an uncontrolled source. The AI may generate code patterns with known vulnerabilities, or code that doesn't meet your shop's security standards. Unlike a human developer who can be trained on your security policies and held accountable, the AI has no concept of your security requirements unless you explicitly include them in the prompt. And even then, there's no guarantee the AI will follow them consistently.

Intellectual property risk: Depending on the AI service's terms of use, code you send to the AI may be used for model training. This could theoretically mean that your proprietary business logic becomes part of the model's training data and could influence outputs for other users. For highly sensitive systems, this risk may be unacceptable, which is why on-premises or private-cloud AI deployments (like IBM watsonx on a private cloud) are preferred by financial institutions.

Audit trail: Regulators in financial services require knowing who wrote and reviewed every line of production code. "The AI wrote it" is not an acceptable audit trail. You need to track which code was AI-generated, who reviewed it, what changes were made during review, and who approved the final version.

At CNB, Kwame's team implemented a policy requiring that all AI-generated code carry a comment block identifying it as AI-generated, the model and version used, the date of generation, and the reviewer who approved it:

      *=============================================================*
      * AI-GENERATED CODE BLOCK
      * Model: IBM watsonx Code Assistant for Z v2.1
      * Generated: 2026-01-15
      * Reviewed by: Lisa Park (LP1234)
      * Review date: 2026-01-17
      * Approval: Kwame Mensah (KW5678) - 2026-01-18
      *=============================================================*

COBOL-Specific Gotchas That Trip Up AI

Beyond the general limitations, COBOL has specific characteristics that AI tools consistently struggle with:

The SECTION problem. Many legacy programs use SECTIONs rather than paragraphs for flow control. A PERFORM of a SECTION executes all paragraphs within it until the next SECTION is reached. AI tools sometimes miss this, analyzing individual paragraphs as if they were independent when they're actually part of a SECTION flow.

The COPY/REPLACING problem. Copybooks included with REPLACING clauses effectively create code that doesn't exist in any source file. The AI sees the base copybook, but the actual compiled code uses the replaced text. This is a significant source of comprehension errors.

The implicit period problem. In legacy COBOL (pre-COBOL-85 structured programming), a period terminates all open scopes. Moving the period to the wrong place changes the logic completely. AI tools that "clean up formatting" may inadvertently move a period, transforming the program's behavior.

The file status problem. Many COBOL programs check FILE STATUS after I/O operations but handle status values that are specific to the file system (VSAM, QSAM, DB2). The AI may know the standard file status values but not the extended codes specific to z/OS file systems.

Cost-Benefit Analysis: When AI Helps and When It Doesn't

Not every COBOL task benefits from AI assistance. Here is a realistic cost-benefit analysis based on experience across all four anchor organizations:

High ROI (use AI aggressively): - Documentation generation for programs with no existing documentation (5-10x productivity gain) - Copybook annotation for large data structures (3-5x gain) - Dead code detection across large codebases (10x gain over manual analysis) - Test case specification generation (3-5x gain, with review) - Cross-reference and dependency documentation (10-20x gain)

Moderate ROI (use AI with caution): - Code comprehension for moderately complex programs (2-3x gain, accuracy varies) - Naming improvement suggestions (2x gain, high review overhead) - Simple refactoring suggestions (GOTO elimination, EVALUATE conversion) (2x gain, high review overhead) - JCL documentation (3x gain, but accuracy for complex JCL is lower)

Low or Negative ROI (avoid AI or use minimally): - Writing new COBOL business logic (AI doesn't know your business rules) - Performance optimization (AI doesn't understand z/OS resource characteristics) - CICS transaction design (too many platform-specific considerations) - DB2 query optimization (requires EXPLAIN output and catalog statistics AI doesn't have) - Debugging production problems (requires runtime context AI can't access) - Anything requiring understanding of cross-program batch chain dependencies

The general principle: AI excels at tasks that are labor-intensive but conceptually straightforward (reading code and writing descriptions). It struggles with tasks that require contextual knowledge it doesn't possess (system architecture, business rules, operational characteristics).

Understanding this distinction saves time and prevents the disillusionment that comes from expecting AI to perform tasks it's fundamentally unsuited for. A team that uses AI for the right tasks and human expertise for the others will dramatically outperform either a team that avoids AI entirely or a team that tries to use AI for everything.

35.7 The Human-AI Workflow: Trust but Verify

The goal is not to replace COBOL developers with AI. The goal is to make COBOL developers more productive by giving them AI-powered tools for the tedious parts of their work — comprehension, documentation, test generation — so they can focus their expertise on the parts that require human judgment: design, business rules, and production operations.

The Review Process

Every AI-generated artifact must go through a structured review process. At Federal Benefits, Sandra developed a three-tier review system:

Tier 1: Automated Validation. Before any human sees the AI output, automated checks verify: - COBOL code compiles without errors - Generated test cases run without abends - Documentation references real program names and field names (no hallucinated artifacts) - Code follows shop standards (naming conventions, comment format, indentation)

Tier 2: Technical Review. A COBOL developer reviews the artifact for: - Technical accuracy (does the code or documentation correctly describe what happens?) - Completeness (are there cases, fields, or behaviors that the AI missed?) - Safety (does any AI-generated code introduce risks not present in the original?)

Tier 3: Business Review. A business analyst or subject matter expert reviews for: - Business accuracy (does the documentation correctly describe the business rules?) - Regulatory compliance (does anything in the AI-generated artifact create compliance risk?) - Domain correctness (are business terms used correctly?)

Not every artifact needs all three tiers. Documentation might skip Tier 1 (no code to compile). Simple copybook annotations might skip Tier 3 (no complex business rules). But refactored code always gets all three tiers plus a full regression test cycle.

The Validation Framework

For the HA banking system, we implement a specific validation framework for AI-assisted changes:

AI CHANGE VALIDATION CHECKLIST — HA BANKING SYSTEM

1. [ ] AI-generated artifact identified and tagged
2. [ ] Artifact compiled/validated by automated tools
3. [ ] Technical review completed by certified COBOL developer
4. [ ] All COBOL-specific gotchas checked:
   a. [ ] COMP-3/COMP handling verified
   b. [ ] REDEFINES correctly understood
   c. [ ] Scope terminators correct
   d. [ ] File status handling preserved
   e. [ ] COPY/REPLACING correctly interpreted
5. [ ] Business review completed (if applicable)
6. [ ] Regression test suite passed
7. [ ] Performance benchmark compared to baseline
8. [ ] Change management ticket updated
9. [ ] Audit trail documentation complete

Practical Integration: The Daily Workflow

Here is what AI-assisted COBOL development looks like in daily practice at a shop that has implemented these tools thoughtfully:

Morning: A developer inherits an unfamiliar program. Instead of spending half a day reading the code, they feed it to the AI and get a program summary, paragraph-by-paragraph annotations, and a data flow diagram in thirty minutes. They spend another thirty minutes reviewing the AI output, correcting two errors, and adding business context they got from a conversation with the business analyst. Total time: one hour instead of four.

Midday: The developer needs to add a new validation check. They describe the validation requirement to the AI and get three implementation options. They evaluate the options against their knowledge of the system, select one, modify it to match shop standards, and generate test cases for the new logic. The AI generates twelve test cases; the developer adds three more for edge cases specific to this system.

Afternoon: The developer reviews a colleague's code change. The AI scans the change and flags three potential issues: a missing scope terminator, a field that changed size without updating all references, and a paragraph that's now unreachable after the change. Two of the three flags are valid issues; the third is a false positive because the AI didn't know about a dynamic CALL. Still, catching those two issues saved a production bug.

End of day: The developer runs the documentation pipeline. Any programs modified today get their documentation regenerated, reviewed, and updated. The documentation stays current because the pipeline runs daily, not quarterly.

Building Trust Incrementally

If you're introducing AI tools to a team of veteran mainframe developers, expect skepticism. It's warranted. Here's how to build trust:

Start read-only. Use AI for comprehension and documentation only. No code generation, no refactoring. Let the team see that the AI can accurately describe their systems before asking them to trust it to modify those systems.

Track accuracy. Keep a scorecard of AI output accuracy. Every review should note errors found. Over time, this gives you an objective measure of how much to trust the tool and where its weaknesses lie.

Celebrate the catches. When a reviewer catches an AI error, that's a success for the process, not a failure of the tool. The review process working correctly is exactly the point.

Measure productivity. Track how long documentation takes with and without AI assistance. Track how many test cases are generated per week. Hard numbers overcome soft skepticism.

Respect the expertise. The 25-year veteran who knows the batch schedule by heart has knowledge that no AI possesses. AI tools augment that expertise; they don't obsolete it. The developer's job evolves from "writing all the code" to "directing and reviewing the AI while making the decisions that require human judgment." That's not a demotion — it's a force multiplier.

The Prompt Library: Your Team's AI Playbook

As your team gains experience with AI tools, you will develop prompt patterns that work well for your specific codebase, naming conventions, and system architecture. These refined prompts are a valuable team asset that should be managed like any other shared resource.

A prompt library is a versioned collection of prompt templates, organized by task type (comprehension, documentation, testing, refactoring), with usage notes and accuracy history. See code/example-01-ai-prompt-templates.md for the foundation templates; your team should customize and extend these.

The prompt library should include:

System context blocks pre-written for each major subsystem. Rather than writing the system context from scratch each time, the developer selects the context block for the subsystem they're working on. At CNB, the library includes context blocks for: core banking transactions, account management, loan processing, regulatory reporting, and batch settlement.

Glossary files that map your organization's business terminology to the data fields that implement them. When the AI knows that "available balance" maps to ACCT-AVAIL-BAL (PIC S9(13)V99 COMP-3) and "ledger balance" maps to ACCT-LEDGER-BAL, it produces more accurate documentation.

Anti-pattern catalogs documenting known AI failure modes for your codebase. If the AI consistently misinterprets your shop's 88-level naming convention, or always gets confused by your use of REDEFINES for print line formatting, document these patterns so developers know to check for them during review.

Accuracy tracking for each prompt template. After every use, the reviewer records the accuracy rate and any corrections needed. Over time, this data shows which templates work well and which need refinement. At SecureFirst, Yuki's team tracked accuracy weekly and refined their prompt templates monthly based on the data.

The prompt library lives in the same Git repository as the code (in a docs/ai-prompts/ directory) and follows the same change management process. When someone improves a prompt template, the improvement is available to the entire team.

Organizational Considerations

Implementing AI-assisted COBOL development is not just a technical initiative — it requires organizational support structures:

AI Champion Role. Designate one developer as the team's AI champion — the person who stays current on AI tool capabilities, maintains the prompt library, trains new team members, and tracks accuracy metrics. At Pinnacle, Ahmad took this role alongside his regular development work, dedicating roughly 20% of his time to AI tooling support.

Review Standards. Publish clear standards for what level of review is required for each type of AI-generated artifact. Not every artifact needs the full three-tier review. A copybook annotation might need only technical review, while AI-suggested refactoring needs technical review, business review, and full regression testing.

Budget for Review. AI-generated artifacts are not free — they require review time. Budget for this. A rough rule of thumb from our anchor organizations: for every hour the AI spends generating, plan for two hours of human review. The net productivity gain is still significant (three hours for a task that previously took ten), but it's not zero marginal cost.

Feedback to Vendors. If you're using a commercial AI tool, provide structured feedback to the vendor about accuracy, failure modes, and missing capabilities. The tools are improving rapidly, and vendor-specific feedback drives the improvements that matter to your use case.

Applying AI to the HA Banking System

For our progressive project, we'll apply AI tools in the following specific ways:

Documentation: Generate program-level documentation for all HA banking system modules, then review and correct
Copybook Annotation: Annotate all copybooks used by the system with field-level documentation
Test Generation: Generate regression test suites for the transaction processing, balance update, and settlement modules
Dead Code Detection: Identify and remove dead code across all programs
Interface Documentation: Document the contracts between programs in the batch chain

These activities are detailed in the project checkpoint (code/project-checkpoint.md).

Chapter Summary

AI-assisted COBOL development is not a future possibility — it's a present reality. LLMs can comprehend COBOL code, generate documentation, suggest refactorings, and create test cases with impressive accuracy. But that accuracy is not 100%, and in mainframe systems, the gap between 90% and 100% accuracy can be measured in millions of dollars and regulatory sanctions.

The key principles for successful AI-assisted COBOL work are:

Always provide context. The AI needs copybooks, system context, and execution context to produce useful output.
Always review output. No AI-generated artifact should reach production without human review by someone who understands both the technology and the business.
Start with read-only applications. Documentation and comprehension before refactoring and code generation.
Track accuracy systematically. Know where your AI tools succeed and where they fail.
Maintain audit trails. Every AI-generated artifact must be traceable.
Respect the veteran. AI tools augment expert knowledge; they don't replace it.

The mainframe isn't going away, the developers who built it are retiring, and the business logic encoded in COBOL is irreplaceable. AI-assisted development is the bridge between the generation that wrote the code and the generation that will maintain it. Build that bridge carefully, with guard rails on both sides.

Looking ahead, AI capabilities for COBOL will continue to improve. Models will become better at understanding mainframe-specific concepts. Tools will become more deeply integrated with development environments. Accuracy will increase. But the fundamental principle will not change: these are tools that augment human judgment, not replace it. The developer who understands both the AI's capabilities and its limitations, who uses the right tool for the right task, and who maintains rigorous review practices — that developer will thrive in the AI-assisted mainframe world. The developer who blindly trusts AI output will eventually produce the kind of production failure that becomes a cautionary tale in someone else's textbook.

The choice between these two outcomes is not about the AI. It's about you.

Next chapter: We take the DevOps practices common in distributed systems and bring them to the mainframe, implementing Git-based source control, CI/CD pipelines, and automated testing on z/OS — the infrastructure that makes AI-assisted development possible at scale.

Prerequisites

In This Chapter

Chapter 35: AI-Assisted COBOL — Using LLMs for Code Understanding, Documentation Generation, and Assisted Refactoring

35.1 AI Meets the Mainframe: Why Now, What's Different

The Convergence of Three Forces

What AI Can and Cannot Do with COBOL

The Tool Landscape

The HA Banking System Context

35.2 Code Understanding: LLMs Reading COBOL

Paragraph-Level Summarization

Effective Prompting for COBOL Comprehension

Whole-Program Analysis

Data Flow Analysis

Cross-Program Analysis

The 10% Problem

Practical Accuracy Benchmarks

35.3 Documentation Generation: Automating the Undocumented

Program-Level Documentation

Copybook Annotation

Batch Job Documentation

The Documentation Pipeline

Cross-Reference Documentation

Documentation Maintenance: The Continuous Challenge

35.4 Assisted Refactoring: AI as a Cautious Partner

Dead Code Detection

Naming Improvement

Structure Modernization

Code Consolidation

The Refactoring Review Checklist

35.5 AI for Testing: Generating What We Should Have Had All Along

Test Case Generation

Test Data Creation

Regression Test Suites

Test Oracle Problem

Building a Test Harness Strategy

35.6 Limitations and Risks: What the Sales Brochure Won't Tell You

Hallucination in the COBOL Context

Subtle Bugs: The COBOL-Specific Gotchas

Security Concerns

COBOL-Specific Gotchas That Trip Up AI

Cost-Benefit Analysis: When AI Helps and When It Doesn't

35.7 The Human-AI Workflow: Trust but Verify

The Review Process

The Validation Framework

Practical Integration: The Daily Workflow

Building Trust Incrementally

The Prompt Library: Your Team's AI Playbook

Organizational Considerations

Applying AI to the HA Banking System

Chapter Summary

Related Reading