Case Study: Implementing Automated Testing for a Banking Platform

Background

Founders National Bank, a mid-tier commercial bank with $12 billion in assets, operated a COBOL-based batch processing suite that was the backbone of its nightly operations. Every evening, beginning at 6:00 PM and concluding by 5:30 AM, a sequence of 287 batch jobs executed in a carefully orchestrated chain that processed the day's transactions, calculated interest, posted fees, generated regulatory reports, produced customer statements, and synchronized data with downstream systems including the bank's ATM network, online banking platform, and general ledger.

The batch suite had been in production for over twenty-five years, growing incrementally as the bank added products and services. Testing of changes to batch programs was almost entirely manual. When a developer modified a program, the testing process followed a well-worn but labor-intensive path: a tester would create test data by hand-crafting records in a test VSAM file, run the modified job step in isolation, and then manually compare the output files against expected results using a file comparison utility and a spreadsheet of expected values.

This manual testing process had three critical problems. First, it was slow. The average test cycle for a single batch program change took five business days, and complex changes that affected multiple interdependent programs could take three to four weeks. Second, it was incomplete. Manual testers could realistically verify only a handful of test scenarios per program change, while the actual business logic often contained hundreds of conditional paths. Third, it was unreliable. Human comparison of output files, particularly those containing millions of records, was inherently error-prone. Defects that affected a small number of records could easily be missed.

The consequences were tangible. In the twelve months before the automation initiative, the bank experienced fourteen production defects in batch processing, of which eight were classified as severity one or severity two, meaning they affected financial accuracy or regulatory reporting. The average cost of a batch production defect, including emergency remediation, reprocessing, customer notification, and regulatory disclosure, was estimated at $87,000.

Karen Nakamura, the bank's VP of Quality Engineering, proposed and received approval for a project to implement automated testing for the batch processing suite. The project was budgeted at $1.2 million over eighteen months, with the goal of reducing the testing cycle from weeks to days and cutting the production defect rate by at least 50%.

Designing the Test Architecture

Karen assembled a team of two COBOL developers, one testing specialist, and one automation engineer. Their first task was to design a test architecture that could accommodate the unique characteristics of mainframe batch processing.

The architecture they developed had four components: a test data generator, a test execution framework, an output comparison engine, and a regression test repository.

The Test Data Generator

The most fundamental challenge was creating realistic test data at scale. Manual testers had typically worked with a few dozen hand-crafted records. Automated testing required datasets of thousands or tens of thousands of records that exercised every branch of the business logic.

The team built a COBOL-based test data generator that could produce records conforming to any VSAM or flat file layout. The generator was driven by configuration files that specified the field-level rules for generating data: value ranges, distributions, constraints, and cross-field dependencies.

       IDENTIFICATION DIVISION.
       PROGRAM-ID. TDATAGEN.
      *================================================================*
      * TEST DATA GENERATOR                                             *
      * Generates test records based on configuration rules.            *
      * Produces realistic data with controlled coverage of             *
      * business rule branches.                                         *
      *================================================================*

       DATA DIVISION.
       WORKING-STORAGE SECTION.
       01  WS-GENERATION-CONTROL.
           05  WS-TOTAL-RECORDS-TO-GEN PIC S9(09) COMP.
           05  WS-RECORDS-GENERATED    PIC S9(09) COMP VALUE 0.
           05  WS-CURRENT-SCENARIO     PIC X(10).

       01  WS-GENERATED-ACCOUNT.
           05  WS-GEN-ACCT-NUMBER      PIC X(12).
           05  WS-GEN-ACCT-TYPE        PIC X(02).
           05  WS-GEN-ACCT-STATUS      PIC X(01).
           05  WS-GEN-BALANCE          PIC S9(13)V99 COMP-3.
           05  WS-GEN-LAST-TXN-DATE    PIC X(10).
           05  WS-GEN-INTEREST-RATE    PIC S9(1)V9(6) COMP-3.
           05  WS-GEN-ACCRUED-INT      PIC S9(09)V99  COMP-3.

       01  WS-SEED-VALUE              PIC S9(09) COMP.

       PROCEDURE DIVISION.
      *================================================================*
       0000-MAIN-CONTROL.
      *================================================================*
           PERFORM 1000-INITIALIZE
           PERFORM 2000-GENERATE-RECORDS
               VARYING WS-RECORDS-GENERATED FROM 1 BY 1
               UNTIL WS-RECORDS-GENERATED >
                     WS-TOTAL-RECORDS-TO-GEN
           PERFORM 9000-TERMINATE
           STOP RUN
           .

      *================================================================*
       2000-GENERATE-RECORDS.
      *================================================================*
           PERFORM 2100-DETERMINE-SCENARIO
           PERFORM 2200-GENERATE-ACCOUNT-NUMBER
           PERFORM 2300-GENERATE-ACCOUNT-TYPE
           PERFORM 2400-GENERATE-BALANCE
           PERFORM 2500-GENERATE-DATE-FIELDS
           PERFORM 2600-GENERATE-INTEREST-FIELDS
           PERFORM 2700-WRITE-TEST-RECORD
           .

      *================================================================*
       2100-DETERMINE-SCENARIO.
      *================================================================*
      *    Distribute generated records across scenarios to
      *    ensure coverage of all business rule branches.
      *    Percentages match production data distributions.
      *----------------------------------------------------------------*
           COMPUTE WS-SCENARIO-SELECTOR =
               FUNCTION MOD(WS-RECORDS-GENERATED, 100)

           EVALUATE TRUE
               WHEN WS-SCENARIO-SELECTOR < 40
      *            40% - Active accounts with normal balances
                   MOVE 'ACTIVE-NRM' TO WS-CURRENT-SCENARIO
                   MOVE 'A' TO WS-GEN-ACCT-STATUS
               WHEN WS-SCENARIO-SELECTOR < 55
      *            15% - Active accounts with zero balance
                   MOVE 'ACTIVE-ZER' TO WS-CURRENT-SCENARIO
                   MOVE 'A' TO WS-GEN-ACCT-STATUS
                   MOVE ZERO TO WS-GEN-BALANCE
               WHEN WS-SCENARIO-SELECTOR < 65
      *            10% - Active accounts with negative balance
                   MOVE 'ACTIVE-NEG' TO WS-CURRENT-SCENARIO
                   MOVE 'A' TO WS-GEN-ACCT-STATUS
               WHEN WS-SCENARIO-SELECTOR < 75
      *            10% - Dormant accounts
                   MOVE 'DORMANT   ' TO WS-CURRENT-SCENARIO
                   MOVE 'D' TO WS-GEN-ACCT-STATUS
               WHEN WS-SCENARIO-SELECTOR < 85
      *            10% - Closed accounts
                   MOVE 'CLOSED    ' TO WS-CURRENT-SCENARIO
                   MOVE 'C' TO WS-GEN-ACCT-STATUS
               WHEN WS-SCENARIO-SELECTOR < 90
      *            5% - Accounts on hold
                   MOVE 'ON-HOLD   ' TO WS-CURRENT-SCENARIO
                   MOVE 'H' TO WS-GEN-ACCT-STATUS
               WHEN WS-SCENARIO-SELECTOR < 95
      *            5% - Boundary value cases
                   MOVE 'BOUNDARY  ' TO WS-CURRENT-SCENARIO
               WHEN OTHER
      *            5% - Edge cases and error conditions
                   MOVE 'EDGE-CASE ' TO WS-CURRENT-SCENARIO
           END-EVALUATE
           .

      *================================================================*
       2400-GENERATE-BALANCE.
      *================================================================*
      *    Generate balance appropriate for the current scenario
      *----------------------------------------------------------------*
           EVALUATE WS-CURRENT-SCENARIO
               WHEN 'ACTIVE-NRM'
                   PERFORM 2410-RANDOM-NUMBER
                   COMPUTE WS-GEN-BALANCE =
                       WS-SEED-VALUE / 100
               WHEN 'ACTIVE-ZER'
                   MOVE ZERO TO WS-GEN-BALANCE
               WHEN 'ACTIVE-NEG'
                   PERFORM 2410-RANDOM-NUMBER
                   COMPUTE WS-GEN-BALANCE =
                       (WS-SEED-VALUE / 100) * -1
               WHEN 'BOUNDARY  '
      *            Generate maximum/minimum field values
                   MOVE 99999999999.99 TO WS-GEN-BALANCE
               WHEN OTHER
                   PERFORM 2410-RANDOM-NUMBER
                   COMPUTE WS-GEN-BALANCE =
                       WS-SEED-VALUE / 100
           END-EVALUATE
           .

The scenario-based approach was critical. By ensuring that generated datasets included records representing every significant business condition (zero balances, negative balances, dormant accounts, boundary values, and so on), the test data generator produced far more thorough coverage than any manual tester could achieve.

The Test Execution Framework

The test execution framework automated the process of setting up test environments, running batch jobs, and collecting results. It was implemented as a combination of REXX scripts (for mainframe job submission and monitoring) and COBOL programs (for data setup and teardown).

A typical automated test execution followed this sequence:

  1. The framework read a test case definition that specified the input data configuration, the programs to execute, and the expected results.
  2. The test data generator produced the required input files and loaded them into the test VSAM datasets.
  3. The framework submitted the batch job and monitored it to completion.
  4. The output comparison engine compared actual results against expected results.
  5. The framework recorded the test outcome (pass, fail, or error) along with detailed diagnostics for any failures.

The Output Comparison Engine

The output comparison engine was a COBOL program that performed field-level comparison of output records against expected results. Unlike a simple file comparison that would flag any difference, the comparison engine understood the record layout and could apply tolerance rules to specific fields:

       IDENTIFICATION DIVISION.
       PROGRAM-ID. TCOMPARE.
      *================================================================*
      * TEST OUTPUT COMPARISON ENGINE                                   *
      * Performs field-level comparison of actual vs. expected           *
      * output with configurable tolerance rules.                       *
      *================================================================*

       DATA DIVISION.
       WORKING-STORAGE SECTION.
       01  WS-COMPARISON-RESULT.
           05  WS-FIELDS-COMPARED      PIC S9(07) COMP.
           05  WS-FIELDS-MATCHED       PIC S9(07) COMP.
           05  WS-FIELDS-MISMATCHED    PIC S9(07) COMP.
           05  WS-FIELDS-IN-TOLERANCE  PIC S9(07) COMP.

       01  WS-TOLERANCE-RULES.
           05  WS-NUMERIC-TOLERANCE    PIC S9(3)V9(4) COMP-3.
           05  WS-DATE-TOLERANCE-DAYS  PIC S9(3) COMP.
           05  WS-IGNORE-TIMESTAMP     PIC X(01).
               88  IGNORE-TIMESTAMPS   VALUE 'Y'.

       PROCEDURE DIVISION.
      *================================================================*
       3000-COMPARE-ACCOUNT-OUTPUT.
      *================================================================*
           ADD 1 TO WS-FIELDS-COMPARED
           IF WS-ACTUAL-ACCT-NUMBER NOT =
              WS-EXPECTED-ACCT-NUMBER
               ADD 1 TO WS-FIELDS-MISMATCHED
               PERFORM 3900-LOG-MISMATCH
           ELSE
               ADD 1 TO WS-FIELDS-MATCHED
           END-IF

      *    Compare balance with tolerance for rounding differences
           ADD 1 TO WS-FIELDS-COMPARED
           COMPUTE WS-BALANCE-DIFF =
               FUNCTION ABS(WS-ACTUAL-BALANCE -
                            WS-EXPECTED-BALANCE)
           END-COMPUTE

           IF WS-BALANCE-DIFF > WS-NUMERIC-TOLERANCE
               ADD 1 TO WS-FIELDS-MISMATCHED
               PERFORM 3900-LOG-MISMATCH
           ELSE
               IF WS-BALANCE-DIFF > ZERO
                   ADD 1 TO WS-FIELDS-IN-TOLERANCE
               ELSE
                   ADD 1 TO WS-FIELDS-MATCHED
               END-IF
           END-IF

      *    Compare interest calculation
           ADD 1 TO WS-FIELDS-COMPARED
           COMPUTE WS-INTEREST-DIFF =
               FUNCTION ABS(WS-ACTUAL-ACCRUED-INT -
                            WS-EXPECTED-ACCRUED-INT)
           END-COMPUTE

           IF WS-INTEREST-DIFF > 0.01
               ADD 1 TO WS-FIELDS-MISMATCHED
               MOVE 'INTEREST ACCRUAL MISMATCH'
                   TO WS-MISMATCH-DESCRIPTION
               MOVE WS-ACTUAL-ACCRUED-INT
                   TO WS-MISMATCH-ACTUAL-VALUE
               MOVE WS-EXPECTED-ACCRUED-INT
                   TO WS-MISMATCH-EXPECTED-VALUE
               PERFORM 3900-LOG-MISMATCH
           ELSE
               ADD 1 TO WS-FIELDS-MATCHED
           END-IF
           .

The tolerance feature was important for financial calculations. Due to rounding differences that could occur when interest rates or calculation formulas were modified, a strict binary comparison would flag records as mismatched when the difference was a fraction of a cent. The tolerance rules allowed the comparison engine to distinguish between genuine defects and acceptable rounding variations.

The Regression Test Repository

The regression test repository was the accumulated library of test cases that were run against every code change. It grew continuously: every defect found in production was accompanied by a new test case that would have caught the defect, ensuring that the same type of error could never escape to production again.

By the end of the first year, the repository contained 4,300 test cases organized by batch program and business function. Each test case included the input data configuration, execution parameters, and expected output files. The complete regression suite could be executed in approximately six hours, compared to the three to four weeks required for the equivalent manual testing.

Building Test Harnesses for Complex Programs

Some of the most critical batch programs were deeply intertwined with other programs and system resources, making isolated testing difficult. The interest calculation program, for example, read from four input files, accessed three DB2 tables, called two subprograms, and wrote to six output files. Testing it in isolation required creating "stubs" for its dependencies.

The team built a test harness framework that could intercept a program's external calls and substitute controlled responses:

       IDENTIFICATION DIVISION.
       PROGRAM-ID. THARNESS.
      *================================================================*
      * TEST HARNESS - Interest Calculation Program                     *
      * Provides controlled test environment by stubbing external       *
      * dependencies and capturing all outputs.                         *
      *================================================================*

       DATA DIVISION.
       WORKING-STORAGE SECTION.

       01  WS-HARNESS-CONTROL.
           05  WS-DB2-STUB-MODE        PIC X(01).
               88  DB2-STUB-ACTIVE     VALUE 'Y'.
               88  DB2-LIVE            VALUE 'N'.
           05  WS-SUBPGM-STUB-MODE     PIC X(01).
               88  SUBPGM-STUB-ACTIVE  VALUE 'Y'.
               88  SUBPGM-LIVE         VALUE 'N'.

       01  WS-CAPTURE-AREA.
           05  WS-CALLS-TO-RATECALC    PIC S9(05) COMP VALUE 0.
           05  WS-CALLS-TO-FEECALC     PIC S9(05) COMP VALUE 0.
           05  WS-DB2-SELECTS          PIC S9(05) COMP VALUE 0.
           05  WS-DB2-UPDATES          PIC S9(05) COMP VALUE 0.
           05  WS-OUTPUT-RECORDS       PIC S9(07) COMP VALUE 0.

       PROCEDURE DIVISION.
      *================================================================*
       0000-MAIN-CONTROL.
      *================================================================*
           PERFORM 0100-SETUP-TEST-ENVIRONMENT
           PERFORM 0200-LOAD-TEST-CONFIGURATION
           PERFORM 0300-LOAD-STUB-DATA

      *    Execute the program under test
           CALL 'INTCALC1' USING WS-PROGRAM-PARAMETERS

      *    Capture and verify results
           PERFORM 5000-CAPTURE-OUTPUTS
           PERFORM 6000-COMPARE-RESULTS
           PERFORM 7000-GENERATE-TEST-REPORT
           STOP RUN
           .

      *================================================================*
       0300-LOAD-STUB-DATA.
      *================================================================*
      *    Load pre-configured responses for DB2 queries
      *    and subprogram calls from the test case definition.
      *----------------------------------------------------------------*
           IF DB2-STUB-ACTIVE
               READ STUB-DATA-FILE INTO WS-DB2-STUB-RECORD
                   AT END SET NO-MORE-STUBS TO TRUE
               END-READ
               PERFORM UNTIL NO-MORE-STUBS
                   ADD 1 TO WS-DB2-STUB-COUNT
                   MOVE WS-DB2-STUB-RECORD
                       TO WS-DB2-STUB-TABLE(WS-DB2-STUB-COUNT)
                   READ STUB-DATA-FILE INTO WS-DB2-STUB-RECORD
                       AT END SET NO-MORE-STUBS TO TRUE
                   END-READ
               END-PERFORM
           END-IF
           .

The harness also captured metrics about the program's behavior during the test: the number of records processed, the number of database operations performed, the number of subprogram calls made, and the final return code. These metrics served as a secondary validation: even if the output records matched expectations, a significant change in the number of database operations or subprogram calls might indicate a logic change that warranted investigation.

Integrating with CI/CD

The most transformative aspect of the automation project was integrating the test framework with the bank's fledgling CI/CD (Continuous Integration / Continuous Delivery) pipeline. While the bank had adopted CI/CD for its distributed systems development years earlier, the mainframe development process had remained entirely manual.

The team configured the pipeline as follows:

  1. When a developer committed a COBOL source change to the version control system, the CI pipeline automatically compiled the program.
  2. If compilation succeeded, the pipeline submitted the unit test suite for the modified program, consisting of all test cases in the regression repository associated with that program.
  3. If unit tests passed, the pipeline submitted an integration test suite that tested the modified program in the context of its upstream and downstream dependencies in the batch chain.
  4. If integration tests passed, the change was automatically promoted to the system test environment for broader regression testing.
  5. A complete regression run of the full 4,300-test suite was executed nightly, regardless of whether any code changes had occurred, to detect environmental or configuration drift.

The pipeline used a combination of Git for source control, Jenkins for orchestration, and custom-built REXX procedures for mainframe job submission. The integration was not seamless; adapting tools designed for distributed systems to mainframe workflows required considerable ingenuity. But the result was a development process where a developer could commit a change in the morning and know by afternoon whether it had passed all automated tests.

Results: From Weeks to Days

The impact of the automated testing initiative was measured across several dimensions.

Testing Cycle Time. The average testing cycle for a single program change dropped from 5 business days (manual) to 4 hours (automated unit tests) or 8 hours (automated unit plus integration tests). For complex changes affecting multiple programs, the cycle dropped from 3-4 weeks to 2-3 days. This acceleration was not merely a convenience; it directly reduced the time to deliver business-requested changes, improving the bank's agility.

Test Coverage. The automated test suite executed an average of 340 test scenarios per program, compared to the 8-12 scenarios typically covered by manual testing. Boundary conditions, error paths, and rare data combinations that manual testers never had time to verify were now tested routinely.

Defect Detection. In the twelve months following full deployment of the automated testing framework, the production defect rate for batch processing dropped from fourteen per year to three per year, a 79% reduction. All three remaining defects were in areas not yet covered by the test suite (two involved inter-system interfaces that required the downstream system to be active, and one involved a timing-dependent condition that occurred only under production load levels).

Defect Detection Timing. Of defects found by automated testing, 72% were caught during unit testing (within hours of the code change), 21% during integration testing (within one business day), and 7% during nightly regression runs. This early detection was significant: a defect caught in unit testing cost an average of 1.5 hours to fix, while a defect that escaped to production cost an average of $87,000 as noted earlier.

Developer Confidence. An intangible but important benefit was the increase in developer confidence when making changes. Before automation, developers were reluctant to modify complex programs because the manual testing process was so slow and unreliable that the risk of introducing a production defect felt high. With automated testing providing rapid, comprehensive validation, developers reported feeling substantially more confident in their changes. This confidence contributed to a measurable increase in the number of improvements and optimizations that developers proposed and implemented.

Test Maintenance. The ongoing cost of maintaining the test suite was approximately 15% of the initial development effort. New test cases were added with every code change (developers were required to include tests for any new or modified logic), and existing test cases occasionally needed updating when business rules changed. The team designated one developer as the Testing Lead, responsible for maintaining test data generators, updating comparison rules, and curating the regression repository.

Challenges and Adaptations

The initiative was not without challenges.

The most significant early challenge was resistance from developers who viewed test writing as an unwelcome addition to their workload. Karen addressed this by demonstrating the time savings: while writing tests added approximately two hours to a typical change, the elimination of the five-day manual testing cycle more than compensated. Once developers experienced the rapid feedback loop of automated testing, resistance faded.

A technical challenge was test data isolation. Because the test environment shared some infrastructure with development, tests occasionally interfered with each other when two developers ran test suites simultaneously. The team resolved this by implementing a reservation system for test datasets, ensuring that each test run operated on its own isolated copy of the input files.

Another challenge was the maintenance burden of expected output files. When a legitimate business rule change was implemented, the expected output for every affected test case had to be updated. The team addressed this by building a "re-baseline" utility that could regenerate expected outputs for a specified set of test cases after a verified business rule change.

Conclusion

Founders National Bank's implementation of automated testing for their COBOL batch processing suite demonstrates that modern testing practices are not only applicable to mainframe COBOL environments but can deliver dramatic improvements in quality, speed, and developer productivity. The keys to success were investing in purpose-built tooling (test data generators, comparison engines, and harness frameworks) tailored to the unique characteristics of batch COBOL processing, integrating those tools into a CI/CD pipeline that provided rapid feedback, and building a culture where automated testing was seen as an essential part of the development process rather than an optional afterthought.

The financial return was compelling: a $1.2 million investment that eliminated an estimated $960,000 in annual production defect costs while simultaneously reducing time-to-delivery by 70%. For banks and other organizations operating large COBOL batch processing environments, the message is clear: automated testing is not a luxury reserved for modern technology stacks. It is an achievable, practical, and highly profitable investment for mainframe environments as well.