Case Study 35.2: SecureFirst's AI-Assisted Code Review — When the AI Found What the Experts Missed (and Missed What They Found)

DataField.Dev

Case Study 35.2: SecureFirst's AI-Assisted Code Review — When the AI Found What the Experts Missed (and Missed What They Found)

Background

Yuki Tanaka had a problem that was becoming embarrassingly visible. SecureFirst Insurance processed 2.3 million claims per year through a COBOL system comprising 380 programs and over 1.4 million lines of code. The company's code review process was good — every production change was reviewed by at least one senior developer before deployment. But "good" wasn't enough anymore.

In Q3 2025, a subtle bug in the premium calculation module caused $4.7 million in overcharges to approximately 31,000 policyholders. The bug had been introduced in a code change three months earlier and had passed code review. The reviewer, a 22-year veteran, had correctly verified the business logic but missed that the change altered the order of COMPUTE operations in a way that changed the truncation behavior for a small subset of policy types.

After the incident, Yuki's CTO asked a pointed question: "If our best reviewer missed this, what else are we missing?"

Yuki proposed supplementing human code review with AI-assisted analysis, specifically targeting the types of errors that human reviewers consistently miss: numeric precision changes, control flow alterations, and data type interaction effects.

The Experiment

Rather than deploying AI review immediately to production, Yuki designed a controlled experiment. She selected forty recent code changes — twenty that had been successfully deployed (no issues found) and twenty that had been rejected or caused post-deployment issues. She ran all forty through an AI review pipeline without telling the AI which category each change belonged to.

The AI pipeline consisted of three analysis passes:

Pass 1: Semantic Difference Analysis

The AI compared the original and modified code and generated a detailed description of every behavioral change, including changes to: - Data flow (which fields are read, written, or moved) - Control flow (which paragraphs execute under which conditions) - Arithmetic precision (truncation, rounding, field size effects) - File I/O (reads, writes, opens, closes) - External interfaces (CALL parameters, return codes)

Pass 2: Risk Assessment

For each behavioral change identified in Pass 1, the AI assessed risk: - Low risk: Formatting changes, comment updates, cosmetic refactoring - Medium risk: Logic changes with clear test paths, new validation checks - High risk: Changes to arithmetic operations, control flow modifications that affect multiple paths, changes to data definitions referenced by multiple programs - Critical risk: Changes to financial calculations, changes affecting data shared between programs, removal of validation checks

Pass 3: COBOL-Specific Checks

The AI ran a specialized checklist of COBOL-specific risk patterns: - Field size changes in copybooks (ripple effects) - COMP-3/COMP conversion changes - REDEFINES affected by data layout changes - Scope terminator changes or additions - PERFORM range changes - FILE STATUS check additions or removals - Sign handling changes (SIGNED vs. UNSIGNED, positive/negative)

Results

The Twenty Successful Changes

Of the twenty changes that had been successfully deployed, the AI found:

14 with no significant findings — the AI agreed with the human reviewer that the changes were safe
4 with low-risk observations that the human reviewer had also noted
2 with potential issues that the human reviewer had not flagged

The two flagged issues were investigated:

Issue 1: A change to a report formatting paragraph had altered the order of MOVE statements in a way that worked correctly for the current data but would produce incorrect output if a particular account type (type 'X' — experimental) ever appeared in production. Type X accounts had been disabled three years earlier, so the risk was theoretical. The AI correctly identified the behavioral change but lacked the context to know that type X accounts were disabled. Verdict: True finding, theoretical risk, no action needed.

Issue 2: A performance optimization had changed a PERFORM VARYING loop to use a binary index (COMP) instead of a packed decimal index (COMP-3). The change was functionally correct and improved performance, but the AI noted that the binary index field was not initialized in all paths to the loop. In the current code, the loop was always entered from one path where the field was initialized. But if future changes added another entry path, the uninitialized index could cause unpredictable behavior. The developer added an explicit initialization as a defensive measure. Verdict: True finding, latent risk, preventive fix applied.

The Twenty Problematic Changes

Of the twenty changes that had caused issues, the AI:

Correctly identified the bug in 12 cases — the AI flagged the specific code change that caused the production issue
Identified the general area but not the specific bug in 4 cases — the AI flagged the changed section as high-risk but didn't pinpoint the exact issue
Missed the issue entirely in 4 cases

The four missed issues were revealing:

Miss 1: Batch timing change. A change altered the order of two SORT steps in a JCL procedure, which caused a downstream program to process records in a different order. The AI analyzed the COBOL code changes (which were minimal) but didn't analyze the JCL change that accompanied them. The AI had no context about the batch chain dependencies.

Miss 2: DB2 plan bind interaction. A code change was correct in isolation, but the DB2 plan hadn't been rebound after a table structure change. The AI analyzed the COBOL source but had no visibility into DB2 catalog information.

Miss 3: Copy member version mismatch. Two programs were modified to use a new version of a shared copybook, but a third program that also used the copybook was not recompiled. The AI analyzed each changed program individually and found no issues; the bug was in the unchanged program that was now out of sync.

Miss 4: CICS transaction timeout. A change added a new DB2 query to an online transaction, increasing its processing time enough to trigger CICS transaction timeout under peak load. The AI correctly described the code change but had no performance model to predict the timeout.

The Quantitative Scorecard

Metric	Human Review Only	AI Review Only	Human + AI
True positive rate (bug detection)	65%	60%	85%
False positive rate	5%	18%	8%
Average review time	45 min	3 min	50 min
COBOL-specific issues found	70%	55%	90%
System-level issues found	60%	15%	65%

The critical finding: neither human review nor AI review alone was as effective as the combination. The AI caught issues that humans missed (numeric precision, data flow changes), and humans caught issues that the AI missed (system-level interactions, business context, environmental dependencies).

Implementation: The Hybrid Review Process

Based on the experiment results, Yuki implemented a hybrid review process:

Step 1: Developer Self-Review with AI Assistance

Before submitting a change for review, the developer runs the AI analysis tool and addresses any findings. This catches routine issues early, reducing the reviewer's workload.

Step 2: AI Pre-Review

The change management system automatically runs the three-pass AI analysis when a change is submitted. The AI report is attached to the review request, highlighting any medium, high, or critical findings.

Step 3: Human Review with AI Context

The human reviewer conducts their traditional review but has the AI report available. They can focus their attention on areas the AI flagged as high-risk and use the AI's semantic difference analysis to understand complex changes more quickly.

Step 4: Escalation for Critical Findings

Any change with AI-flagged critical findings requires review by two senior developers rather than one, plus a mandatory regression test run.

Carlos's Integration Challenge

Carlos Mendez, SecureFirst's automation lead, was responsible for integrating the AI review tool into their existing change management workflow. The technical integration was straightforward — the AI tool accepted COBOL source files and produced a JSON report. The cultural integration was harder.

"Half the team thought the AI was going to replace them," Carlos recalled. "The other half thought it was a toy that would just slow them down. I had to show both groups they were wrong."

Carlos organized a workshop where he showed the forty-change experiment results without revealing the answer key. He asked the developers to predict which changes the AI would catch and which it would miss. The exercise was eye-opening: even the most skeptical developers were surprised that the AI caught the numeric precision issue that had caused the $4.7 million overcharge — the very issue that had motivated the project.

"When José saw that the AI flagged the exact truncation behavior he'd missed in his review, he stopped being skeptical and started asking how to write better prompts for the tool. That was the turning point."

Six-Month Outcomes

After six months of hybrid review:

Production incidents from code changes: Down 41% compared to the previous year
Mean time to review: Up 11% (50 minutes vs. 45 minutes) — a modest increase for a significant quality improvement
Developer satisfaction: 72% rated the AI tool as "helpful" or "very helpful" in anonymous surveys (up from 34% at launch)
False positive fatigue: The initial 18% false positive rate was reduced to 11% through prompt refinement and custom rules for SecureFirst-specific patterns

The most unexpected benefit was educational. Junior developers reported that reading the AI's semantic difference analysis taught them to think about COBOL code changes more rigorously. "The AI report is like having a really pedantic reviewer explain every possible consequence of your change," said one developer with three years of experience. "At first it was annoying. Then I realized I was learning to think that way myself."

The $4.7 Million Question

Yuki ran the AI review tool against the specific code change that had caused the original premium calculation bug. The AI flagged it in 47 seconds:

CRITICAL: Arithmetic precision change detected in paragraph 7200-CALC-ANNUAL-PREMIUM. The reordering of COMPUTE statements changes intermediate truncation behavior. Original code: COMPUTE WS-WORK-AMT = WS-BASE-RATE * WS-FACTOR-1 followed by COMPUTE WS-PREMIUM = WS-WORK-AMT * WS-FACTOR-2 (two-step with intermediate truncation to PIC S9(7)V99). Modified code: COMPUTE WS-PREMIUM = WS-BASE-RATE * WS-FACTOR-1 * WS-FACTOR-2 (single-step, truncation only on final result). These are NOT semantically equivalent when WS-WORK-AMT has fewer decimal places than the intermediate mathematical result.

The reviewer who had missed this in the original review read the AI analysis and nodded. "That's exactly right. I looked at the math and thought, 'same formula, same result.' I didn't think about the intermediate truncation. The AI did."

Discussion Questions

The AI review had a higher false positive rate (18%) than human review (5%). How would you manage false positive fatigue in a team that reviews dozens of changes per week? At what false positive rate does the tool become counterproductive?
All four issues the AI missed were system-level problems (JCL ordering, DB2 binds, copybook version mismatch, CICS timeout). What does this tell you about the fundamental limitations of code-level AI analysis? How might these limitations be addressed?
Carlos faced cultural resistance from both skeptics and technophiles. Design a change management approach for introducing AI review tools to a team where the median experience level is 20+ years.
The AI caught the $4.7 million bug in 47 seconds. Does this mean the original reviewer was negligent? How should organizations handle accountability when AI tools reveal that human reviewers missed critical issues?
Junior developers reported that reading AI review reports was educational. Should AI review reports be used as a formal training tool? What are the risks of junior developers learning review practices primarily from AI output rather than from senior mentors?