Case Study: Implementing Automated Testing for a Banking Platform
Background
Founders National Bank, a mid-tier commercial bank with $12 billion in assets, operated a COBOL-based batch processing suite that was the backbone of its nightly operations. Every evening, beginning at 6:00 PM and concluding by 5:30 AM, a sequence of 287 batch jobs executed in a carefully orchestrated chain that processed the day's transactions, calculated interest, posted fees, generated regulatory reports, produced customer statements, and synchronized data with downstream systems including the bank's ATM network, online banking platform, and general ledger.
The batch suite had been in production for over twenty-five years, growing incrementally as the bank added products and services. Testing of changes to batch programs was almost entirely manual. When a developer modified a program, the testing process followed a well-worn but labor-intensive path: a tester would create test data by hand-crafting records in a test VSAM file, run the modified job step in isolation, and then manually compare the output files against expected results using a file comparison utility and a spreadsheet of expected values.
This manual testing process had three critical problems. First, it was slow. The average test cycle for a single batch program change took five business days, and complex changes that affected multiple interdependent programs could take three to four weeks. Second, it was incomplete. Manual testers could realistically verify only a handful of test scenarios per program change, while the actual business logic often contained hundreds of conditional paths. Third, it was unreliable. Human comparison of output files, particularly those containing millions of records, was inherently error-prone. Defects that affected a small number of records could easily be missed.
The consequences were tangible. In the twelve months before the automation initiative, the bank experienced fourteen production defects in batch processing, of which eight were classified as severity one or severity two, meaning they affected financial accuracy or regulatory reporting. The average cost of a batch production defect, including emergency remediation, reprocessing, customer notification, and regulatory disclosure, was estimated at $87,000.
Karen Nakamura, the bank's VP of Quality Engineering, proposed and received approval for a project to implement automated testing for the batch processing suite. The project was budgeted at $1.2 million over eighteen months, with the goal of reducing the testing cycle from weeks to days and cutting the production defect rate by at least 50%.
Designing the Test Architecture
Karen assembled a team of two COBOL developers, one testing specialist, and one automation engineer. Their first task was to design a test architecture that could accommodate the unique characteristics of mainframe batch processing.
The architecture they developed had four components: a test data generator, a test execution framework, an output comparison engine, and a regression test repository.
The Test Data Generator
The most fundamental challenge was creating realistic test data at scale. Manual testers had typically worked with a few dozen hand-crafted records. Automated testing required datasets of thousands or tens of thousands of records that exercised every branch of the business logic.
The team built a COBOL-based test data generator that could produce records conforming to any VSAM or flat file layout. The generator was driven by configuration files that specified the field-level rules for generating data: value ranges, distributions, constraints, and cross-field dependencies.
IDENTIFICATION DIVISION.
PROGRAM-ID. TDATAGEN.
*================================================================*
* TEST DATA GENERATOR *
* Generates test records based on configuration rules. *
* Produces realistic data with controlled coverage of *
* business rule branches. *
*================================================================*
DATA DIVISION.
WORKING-STORAGE SECTION.
01 WS-GENERATION-CONTROL.
05 WS-TOTAL-RECORDS-TO-GEN PIC S9(09) COMP.
05 WS-RECORDS-GENERATED PIC S9(09) COMP VALUE 0.
05 WS-CURRENT-SCENARIO PIC X(10).
01 WS-GENERATED-ACCOUNT.
05 WS-GEN-ACCT-NUMBER PIC X(12).
05 WS-GEN-ACCT-TYPE PIC X(02).
05 WS-GEN-ACCT-STATUS PIC X(01).
05 WS-GEN-BALANCE PIC S9(13)V99 COMP-3.
05 WS-GEN-LAST-TXN-DATE PIC X(10).
05 WS-GEN-INTEREST-RATE PIC S9(1)V9(6) COMP-3.
05 WS-GEN-ACCRUED-INT PIC S9(09)V99 COMP-3.
01 WS-SEED-VALUE PIC S9(09) COMP.
PROCEDURE DIVISION.
*================================================================*
0000-MAIN-CONTROL.
*================================================================*
PERFORM 1000-INITIALIZE
PERFORM 2000-GENERATE-RECORDS
VARYING WS-RECORDS-GENERATED FROM 1 BY 1
UNTIL WS-RECORDS-GENERATED >
WS-TOTAL-RECORDS-TO-GEN
PERFORM 9000-TERMINATE
STOP RUN
.
*================================================================*
2000-GENERATE-RECORDS.
*================================================================*
PERFORM 2100-DETERMINE-SCENARIO
PERFORM 2200-GENERATE-ACCOUNT-NUMBER
PERFORM 2300-GENERATE-ACCOUNT-TYPE
PERFORM 2400-GENERATE-BALANCE
PERFORM 2500-GENERATE-DATE-FIELDS
PERFORM 2600-GENERATE-INTEREST-FIELDS
PERFORM 2700-WRITE-TEST-RECORD
.
*================================================================*
2100-DETERMINE-SCENARIO.
*================================================================*
* Distribute generated records across scenarios to
* ensure coverage of all business rule branches.
* Percentages match production data distributions.
*----------------------------------------------------------------*
COMPUTE WS-SCENARIO-SELECTOR =
FUNCTION MOD(WS-RECORDS-GENERATED, 100)
EVALUATE TRUE
WHEN WS-SCENARIO-SELECTOR < 40
* 40% - Active accounts with normal balances
MOVE 'ACTIVE-NRM' TO WS-CURRENT-SCENARIO
MOVE 'A' TO WS-GEN-ACCT-STATUS
WHEN WS-SCENARIO-SELECTOR < 55
* 15% - Active accounts with zero balance
MOVE 'ACTIVE-ZER' TO WS-CURRENT-SCENARIO
MOVE 'A' TO WS-GEN-ACCT-STATUS
MOVE ZERO TO WS-GEN-BALANCE
WHEN WS-SCENARIO-SELECTOR < 65
* 10% - Active accounts with negative balance
MOVE 'ACTIVE-NEG' TO WS-CURRENT-SCENARIO
MOVE 'A' TO WS-GEN-ACCT-STATUS
WHEN WS-SCENARIO-SELECTOR < 75
* 10% - Dormant accounts
MOVE 'DORMANT ' TO WS-CURRENT-SCENARIO
MOVE 'D' TO WS-GEN-ACCT-STATUS
WHEN WS-SCENARIO-SELECTOR < 85
* 10% - Closed accounts
MOVE 'CLOSED ' TO WS-CURRENT-SCENARIO
MOVE 'C' TO WS-GEN-ACCT-STATUS
WHEN WS-SCENARIO-SELECTOR < 90
* 5% - Accounts on hold
MOVE 'ON-HOLD ' TO WS-CURRENT-SCENARIO
MOVE 'H' TO WS-GEN-ACCT-STATUS
WHEN WS-SCENARIO-SELECTOR < 95
* 5% - Boundary value cases
MOVE 'BOUNDARY ' TO WS-CURRENT-SCENARIO
WHEN OTHER
* 5% - Edge cases and error conditions
MOVE 'EDGE-CASE ' TO WS-CURRENT-SCENARIO
END-EVALUATE
.
*================================================================*
2400-GENERATE-BALANCE.
*================================================================*
* Generate balance appropriate for the current scenario
*----------------------------------------------------------------*
EVALUATE WS-CURRENT-SCENARIO
WHEN 'ACTIVE-NRM'
PERFORM 2410-RANDOM-NUMBER
COMPUTE WS-GEN-BALANCE =
WS-SEED-VALUE / 100
WHEN 'ACTIVE-ZER'
MOVE ZERO TO WS-GEN-BALANCE
WHEN 'ACTIVE-NEG'
PERFORM 2410-RANDOM-NUMBER
COMPUTE WS-GEN-BALANCE =
(WS-SEED-VALUE / 100) * -1
WHEN 'BOUNDARY '
* Generate maximum/minimum field values
MOVE 99999999999.99 TO WS-GEN-BALANCE
WHEN OTHER
PERFORM 2410-RANDOM-NUMBER
COMPUTE WS-GEN-BALANCE =
WS-SEED-VALUE / 100
END-EVALUATE
.
The scenario-based approach was critical. By ensuring that generated datasets included records representing every significant business condition (zero balances, negative balances, dormant accounts, boundary values, and so on), the test data generator produced far more thorough coverage than any manual tester could achieve.
The Test Execution Framework
The test execution framework automated the process of setting up test environments, running batch jobs, and collecting results. It was implemented as a combination of REXX scripts (for mainframe job submission and monitoring) and COBOL programs (for data setup and teardown).
A typical automated test execution followed this sequence:
- The framework read a test case definition that specified the input data configuration, the programs to execute, and the expected results.
- The test data generator produced the required input files and loaded them into the test VSAM datasets.
- The framework submitted the batch job and monitored it to completion.
- The output comparison engine compared actual results against expected results.
- The framework recorded the test outcome (pass, fail, or error) along with detailed diagnostics for any failures.
The Output Comparison Engine
The output comparison engine was a COBOL program that performed field-level comparison of output records against expected results. Unlike a simple file comparison that would flag any difference, the comparison engine understood the record layout and could apply tolerance rules to specific fields:
IDENTIFICATION DIVISION.
PROGRAM-ID. TCOMPARE.
*================================================================*
* TEST OUTPUT COMPARISON ENGINE *
* Performs field-level comparison of actual vs. expected *
* output with configurable tolerance rules. *
*================================================================*
DATA DIVISION.
WORKING-STORAGE SECTION.
01 WS-COMPARISON-RESULT.
05 WS-FIELDS-COMPARED PIC S9(07) COMP.
05 WS-FIELDS-MATCHED PIC S9(07) COMP.
05 WS-FIELDS-MISMATCHED PIC S9(07) COMP.
05 WS-FIELDS-IN-TOLERANCE PIC S9(07) COMP.
01 WS-TOLERANCE-RULES.
05 WS-NUMERIC-TOLERANCE PIC S9(3)V9(4) COMP-3.
05 WS-DATE-TOLERANCE-DAYS PIC S9(3) COMP.
05 WS-IGNORE-TIMESTAMP PIC X(01).
88 IGNORE-TIMESTAMPS VALUE 'Y'.
PROCEDURE DIVISION.
*================================================================*
3000-COMPARE-ACCOUNT-OUTPUT.
*================================================================*
ADD 1 TO WS-FIELDS-COMPARED
IF WS-ACTUAL-ACCT-NUMBER NOT =
WS-EXPECTED-ACCT-NUMBER
ADD 1 TO WS-FIELDS-MISMATCHED
PERFORM 3900-LOG-MISMATCH
ELSE
ADD 1 TO WS-FIELDS-MATCHED
END-IF
* Compare balance with tolerance for rounding differences
ADD 1 TO WS-FIELDS-COMPARED
COMPUTE WS-BALANCE-DIFF =
FUNCTION ABS(WS-ACTUAL-BALANCE -
WS-EXPECTED-BALANCE)
END-COMPUTE
IF WS-BALANCE-DIFF > WS-NUMERIC-TOLERANCE
ADD 1 TO WS-FIELDS-MISMATCHED
PERFORM 3900-LOG-MISMATCH
ELSE
IF WS-BALANCE-DIFF > ZERO
ADD 1 TO WS-FIELDS-IN-TOLERANCE
ELSE
ADD 1 TO WS-FIELDS-MATCHED
END-IF
END-IF
* Compare interest calculation
ADD 1 TO WS-FIELDS-COMPARED
COMPUTE WS-INTEREST-DIFF =
FUNCTION ABS(WS-ACTUAL-ACCRUED-INT -
WS-EXPECTED-ACCRUED-INT)
END-COMPUTE
IF WS-INTEREST-DIFF > 0.01
ADD 1 TO WS-FIELDS-MISMATCHED
MOVE 'INTEREST ACCRUAL MISMATCH'
TO WS-MISMATCH-DESCRIPTION
MOVE WS-ACTUAL-ACCRUED-INT
TO WS-MISMATCH-ACTUAL-VALUE
MOVE WS-EXPECTED-ACCRUED-INT
TO WS-MISMATCH-EXPECTED-VALUE
PERFORM 3900-LOG-MISMATCH
ELSE
ADD 1 TO WS-FIELDS-MATCHED
END-IF
.
The tolerance feature was important for financial calculations. Due to rounding differences that could occur when interest rates or calculation formulas were modified, a strict binary comparison would flag records as mismatched when the difference was a fraction of a cent. The tolerance rules allowed the comparison engine to distinguish between genuine defects and acceptable rounding variations.
The Regression Test Repository
The regression test repository was the accumulated library of test cases that were run against every code change. It grew continuously: every defect found in production was accompanied by a new test case that would have caught the defect, ensuring that the same type of error could never escape to production again.
By the end of the first year, the repository contained 4,300 test cases organized by batch program and business function. Each test case included the input data configuration, execution parameters, and expected output files. The complete regression suite could be executed in approximately six hours, compared to the three to four weeks required for the equivalent manual testing.
Building Test Harnesses for Complex Programs
Some of the most critical batch programs were deeply intertwined with other programs and system resources, making isolated testing difficult. The interest calculation program, for example, read from four input files, accessed three DB2 tables, called two subprograms, and wrote to six output files. Testing it in isolation required creating "stubs" for its dependencies.
The team built a test harness framework that could intercept a program's external calls and substitute controlled responses:
IDENTIFICATION DIVISION.
PROGRAM-ID. THARNESS.
*================================================================*
* TEST HARNESS - Interest Calculation Program *
* Provides controlled test environment by stubbing external *
* dependencies and capturing all outputs. *
*================================================================*
DATA DIVISION.
WORKING-STORAGE SECTION.
01 WS-HARNESS-CONTROL.
05 WS-DB2-STUB-MODE PIC X(01).
88 DB2-STUB-ACTIVE VALUE 'Y'.
88 DB2-LIVE VALUE 'N'.
05 WS-SUBPGM-STUB-MODE PIC X(01).
88 SUBPGM-STUB-ACTIVE VALUE 'Y'.
88 SUBPGM-LIVE VALUE 'N'.
01 WS-CAPTURE-AREA.
05 WS-CALLS-TO-RATECALC PIC S9(05) COMP VALUE 0.
05 WS-CALLS-TO-FEECALC PIC S9(05) COMP VALUE 0.
05 WS-DB2-SELECTS PIC S9(05) COMP VALUE 0.
05 WS-DB2-UPDATES PIC S9(05) COMP VALUE 0.
05 WS-OUTPUT-RECORDS PIC S9(07) COMP VALUE 0.
PROCEDURE DIVISION.
*================================================================*
0000-MAIN-CONTROL.
*================================================================*
PERFORM 0100-SETUP-TEST-ENVIRONMENT
PERFORM 0200-LOAD-TEST-CONFIGURATION
PERFORM 0300-LOAD-STUB-DATA
* Execute the program under test
CALL 'INTCALC1' USING WS-PROGRAM-PARAMETERS
* Capture and verify results
PERFORM 5000-CAPTURE-OUTPUTS
PERFORM 6000-COMPARE-RESULTS
PERFORM 7000-GENERATE-TEST-REPORT
STOP RUN
.
*================================================================*
0300-LOAD-STUB-DATA.
*================================================================*
* Load pre-configured responses for DB2 queries
* and subprogram calls from the test case definition.
*----------------------------------------------------------------*
IF DB2-STUB-ACTIVE
READ STUB-DATA-FILE INTO WS-DB2-STUB-RECORD
AT END SET NO-MORE-STUBS TO TRUE
END-READ
PERFORM UNTIL NO-MORE-STUBS
ADD 1 TO WS-DB2-STUB-COUNT
MOVE WS-DB2-STUB-RECORD
TO WS-DB2-STUB-TABLE(WS-DB2-STUB-COUNT)
READ STUB-DATA-FILE INTO WS-DB2-STUB-RECORD
AT END SET NO-MORE-STUBS TO TRUE
END-READ
END-PERFORM
END-IF
.
The harness also captured metrics about the program's behavior during the test: the number of records processed, the number of database operations performed, the number of subprogram calls made, and the final return code. These metrics served as a secondary validation: even if the output records matched expectations, a significant change in the number of database operations or subprogram calls might indicate a logic change that warranted investigation.
Integrating with CI/CD
The most transformative aspect of the automation project was integrating the test framework with the bank's fledgling CI/CD (Continuous Integration / Continuous Delivery) pipeline. While the bank had adopted CI/CD for its distributed systems development years earlier, the mainframe development process had remained entirely manual.
The team configured the pipeline as follows:
- When a developer committed a COBOL source change to the version control system, the CI pipeline automatically compiled the program.
- If compilation succeeded, the pipeline submitted the unit test suite for the modified program, consisting of all test cases in the regression repository associated with that program.
- If unit tests passed, the pipeline submitted an integration test suite that tested the modified program in the context of its upstream and downstream dependencies in the batch chain.
- If integration tests passed, the change was automatically promoted to the system test environment for broader regression testing.
- A complete regression run of the full 4,300-test suite was executed nightly, regardless of whether any code changes had occurred, to detect environmental or configuration drift.
The pipeline used a combination of Git for source control, Jenkins for orchestration, and custom-built REXX procedures for mainframe job submission. The integration was not seamless; adapting tools designed for distributed systems to mainframe workflows required considerable ingenuity. But the result was a development process where a developer could commit a change in the morning and know by afternoon whether it had passed all automated tests.
Results: From Weeks to Days
The impact of the automated testing initiative was measured across several dimensions.
Testing Cycle Time. The average testing cycle for a single program change dropped from 5 business days (manual) to 4 hours (automated unit tests) or 8 hours (automated unit plus integration tests). For complex changes affecting multiple programs, the cycle dropped from 3-4 weeks to 2-3 days. This acceleration was not merely a convenience; it directly reduced the time to deliver business-requested changes, improving the bank's agility.
Test Coverage. The automated test suite executed an average of 340 test scenarios per program, compared to the 8-12 scenarios typically covered by manual testing. Boundary conditions, error paths, and rare data combinations that manual testers never had time to verify were now tested routinely.
Defect Detection. In the twelve months following full deployment of the automated testing framework, the production defect rate for batch processing dropped from fourteen per year to three per year, a 79% reduction. All three remaining defects were in areas not yet covered by the test suite (two involved inter-system interfaces that required the downstream system to be active, and one involved a timing-dependent condition that occurred only under production load levels).
Defect Detection Timing. Of defects found by automated testing, 72% were caught during unit testing (within hours of the code change), 21% during integration testing (within one business day), and 7% during nightly regression runs. This early detection was significant: a defect caught in unit testing cost an average of 1.5 hours to fix, while a defect that escaped to production cost an average of $87,000 as noted earlier.
Developer Confidence. An intangible but important benefit was the increase in developer confidence when making changes. Before automation, developers were reluctant to modify complex programs because the manual testing process was so slow and unreliable that the risk of introducing a production defect felt high. With automated testing providing rapid, comprehensive validation, developers reported feeling substantially more confident in their changes. This confidence contributed to a measurable increase in the number of improvements and optimizations that developers proposed and implemented.
Test Maintenance. The ongoing cost of maintaining the test suite was approximately 15% of the initial development effort. New test cases were added with every code change (developers were required to include tests for any new or modified logic), and existing test cases occasionally needed updating when business rules changed. The team designated one developer as the Testing Lead, responsible for maintaining test data generators, updating comparison rules, and curating the regression repository.
Challenges and Adaptations
The initiative was not without challenges.
The most significant early challenge was resistance from developers who viewed test writing as an unwelcome addition to their workload. Karen addressed this by demonstrating the time savings: while writing tests added approximately two hours to a typical change, the elimination of the five-day manual testing cycle more than compensated. Once developers experienced the rapid feedback loop of automated testing, resistance faded.
A technical challenge was test data isolation. Because the test environment shared some infrastructure with development, tests occasionally interfered with each other when two developers ran test suites simultaneously. The team resolved this by implementing a reservation system for test datasets, ensuring that each test run operated on its own isolated copy of the input files.
Another challenge was the maintenance burden of expected output files. When a legitimate business rule change was implemented, the expected output for every affected test case had to be updated. The team addressed this by building a "re-baseline" utility that could regenerate expected outputs for a specified set of test cases after a verified business rule change.
Conclusion
Founders National Bank's implementation of automated testing for their COBOL batch processing suite demonstrates that modern testing practices are not only applicable to mainframe COBOL environments but can deliver dramatic improvements in quality, speed, and developer productivity. The keys to success were investing in purpose-built tooling (test data generators, comparison engines, and harness frameworks) tailored to the unique characteristics of batch COBOL processing, integrating those tools into a CI/CD pipeline that provided rapid feedback, and building a culture where automated testing was seen as an essential part of the development process rather than an optional afterthought.
The financial return was compelling: a $1.2 million investment that eliminated an estimated $960,000 in annual production defect costs while simultaneously reducing time-to-delivery by 70%. For banks and other organizations operating large COBOL batch processing environments, the message is clear: automated testing is not a luxury reserved for modern technology stacks. It is an achievable, practical, and highly profitable investment for mainframe environments as well.