Case Study 35.1: Federal Benefits' AI Documentation Project — Sandra's Race Against the Retirement Clock

DataField.Dev

Case Study 35.1: Federal Benefits' AI Documentation Project — Sandra's Race Against the Retirement Clock

Background

Sandra Chen had been the lead systems architect at the Federal Benefits Administration for fourteen years, and she had watched the knowledge crisis build like a slow-motion avalanche. In 2024, when Congress mandated a comprehensive modernization review of all federal benefit systems, Sandra's team was required to produce complete documentation for 847 COBOL programs, 312 copybooks, and 189 JCL procedures — the backbone of a system that processed $340 billion in annual benefit payments.

The problem was stark: only 23% of the programs had any documentation at all, and most of what existed was outdated. Marcus Jefferson, the senior developer who understood the eligibility determination subsystem better than anyone alive, was eighteen months from mandatory retirement. Two other subject matter experts were within three years. The knowledge was walking out the door, and Sandra had neither the time nor the staff to document everything manually.

"We estimated eighteen months for manual documentation at our staffing level," Sandra told the modernization review board. "Marcus retires in eighteen months. If we start manually, we finish right as the last person who can verify the documentation leaves. That's not a plan — that's a prayer."

The Decision

Sandra proposed an AI-assisted documentation initiative — the first of its kind in a federal civilian agency of this scale. The proposal was controversial. The agency's CISO raised concerns about sending source code to commercial AI services. The union representing IT workers questioned whether the tools would be used to justify reducing headcount. The Inspector General's office wanted to know how AI-generated documentation would be treated in audits.

Sandra addressed each concern:

Security: All AI processing would use IBM's watsonx Code Assistant for Z, deployed on the agency's private cloud within their FedRAMP-authorized environment. No source code would leave the agency's security boundary.

Workforce: The AI tools would not replace any positions. Instead, they would be used to capture institutional knowledge before it retired, making the remaining workforce more effective. Sandra committed this to writing in a formal memorandum to the union.

Audit: All AI-generated documentation would carry provenance tags identifying it as AI-generated, the model version used, the reviewer who verified it, and the date of review. The agency's audit office would treat AI-generated, human-reviewed documentation the same as human-written documentation.

The project was approved in March 2025 with a twelve-month timeline and a $2.1 million budget — less than a third of the estimated cost for manual documentation.

Implementation

Phase 1: Inventory and Prioritization (Months 1-2)

Sandra's team cataloged every program, copybook, and JCL procedure in the system. They classified each by:

Business criticality: Programs processing benefit payments ranked highest
Knowledge risk: Programs where the sole SME was within two years of retirement ranked highest
Complexity: Programs over 5,000 lines ranked higher due to the difficulty of manual comprehension
Change frequency: Programs modified in the last two years ranked higher because active programs need current documentation

The eligibility determination subsystem — Marcus's domain — scored highest on all four criteria. It became the pilot.

Phase 2: Pilot — The Eligibility Subsystem (Months 2-4)

The eligibility subsystem consisted of 47 COBOL programs, 28 copybooks, and 15 JCL procedures that together determined whether applicants qualified for various federal benefit programs. The logic was intricate, reflecting forty years of legislative changes layered on top of each other.

Sandra's team developed prompt templates tailored to the eligibility subsystem (see Chapter 35, code/example-01 for the general templates). They included:

Program summary template: Produced one-page overviews of each program's purpose, inputs, outputs, and key logic
Copybook annotation template: Generated field-level comments for every copybook
Data flow template: Traced key data elements (applicant income, household size, benefit amount) through the entire subsystem
Interface contract template: Documented the file-based and CALL-based interfaces between programs
JCL runbook template: Generated plain-English descriptions of each batch job

The AI processing took three days. The review took six weeks.

The Review Gauntlet

Marcus reviewed every piece of generated documentation for the eligibility subsystem. His corrections revealed consistent patterns:

Pattern 1: Legislative Context. The AI correctly described what the code did but never explained why. A paragraph that checked whether household income was below 138% of the federal poverty level was accurately described as "a comparison of household income against a threshold," but Marcus annotated it: "Implements the ACA Medicaid expansion income test (42 U.S.C. Section 1396a(e)(14)). The 138% includes the 5% income disregard mandated by CMS."

Pattern 2: Historical Workarounds. Several programs contained code blocks that appeared illogical to the AI. One program calculated a benefit amount, then immediately recalculated it using a different formula, then chose the higher of the two results. The AI flagged this as potential dead code or redundant calculation. Marcus explained: "The first calculation uses the pre-2012 formula for grandfathered beneficiaries. The second uses the post-2012 formula. The MAX ensures no beneficiary receives less than they would have under the old rules. This is a statutory hold-harmless provision."

Pattern 3: Implicit Business Rules. The AI missed business rules encoded in data values rather than program logic. A field containing '99' in a date position didn't mean 1999 — it meant "ongoing/no end date." This convention was nowhere in the source code; it was institutional knowledge.

Marcus's review reduced the AI's accuracy from an initial estimate of 88% to an actual 79% for the eligibility subsystem — lower than expected, because the eligibility logic was more policy-dependent than the simpler transaction processing systems.

Phase 3: Full-Scale Rollout (Months 4-10)

Armed with lessons from the pilot, Sandra refined the process:

Enhanced prompts included statutory references and business glossaries to give the AI more context
Pre-review checklists included specific items for legislative context, historical workarounds, and data value conventions
Review pairs consisted of one technical reviewer and one business/policy analyst, addressing the pattern that the AI's technical accuracy was higher than its business accuracy
Correction feedback was cataloged and used to improve prompts for similar programs

The team processed programs in waves of 50, prioritized by the risk matrix. Each wave took approximately three weeks: one day for AI generation, two weeks for review, and two to three days for publication and quality assurance.

Phase 4: Verification and Closure (Months 10-12)

By month ten, 791 of 847 programs had reviewed documentation. The remaining 56 were low-criticality utility programs that Sandra deprioritized to focus review resources on the high-risk programs.

The final accuracy metrics across all reviewed documentation:

Category	AI Accuracy (Pre-Review)	Common Error Types
Program summaries	84%	Missing legislative context, wrong business terminology
Copybook annotations	91%	Incorrect interpretation of code values, missing REDEFINES notes
Data flow analysis	78%	Missed cross-program flows, incorrect COMP-3 handling
Interface contracts	82%	Missing error conditions, wrong record counts
JCL runbook entries	86%	Incorrect restart procedures, wrong GDG handling

Outcomes

Knowledge capture: The documentation project captured 73% of the institutional knowledge Marcus identified as critical before his retirement. Marcus spent his last six months reviewing and annotating rather than writing from scratch, covering three times the ground he would have covered manually.

Cost: The project came in at $1.87 million — under budget. The comparable manual effort was estimated at $6.5 million over 24 months with contract staff.

Time: Twelve months instead of the estimated eighteen months for manual documentation. More importantly, the high-risk documentation (Marcus's domain) was complete eight months before his retirement date.

Ongoing maintenance: Sandra established a weekly documentation pipeline that regenerated documentation for any modified program. The AI draft, combined with a lightweight review by the developer who made the change, kept documentation current — a state the agency had never achieved before.

Audit outcome: The Inspector General's office audited the documentation process and gave it a satisfactory rating, noting that "the AI-assisted process produced documentation that met or exceeded the quality of manually-written documentation in comparable systems, with full provenance tracking."

Lessons Learned

Business context is the hardest gap to close. The AI excels at technical description but struggles with "why" — the legislative, regulatory, and historical context that gives the code its meaning. Human expertise is irreplaceable for this layer.
Start with the experts, not the code. Sandra's biggest regret was not interviewing Marcus in the first month to catalog the business rules and conventions before running the AI. If she'd had Marcus's institutional knowledge as input to the prompts, the AI accuracy would have been higher from the start.
Pair technical and business reviewers. Technical accuracy and business accuracy are different skills. The AI's technical accuracy was consistently 5-10 points higher than its business accuracy.
Build the maintenance pipeline from day one. Documentation decays. If you don't automate the regeneration process, the AI-generated documentation will be as stale as the manual documentation it replaced within two years.
Respect the workforce. Sandra's explicit commitment that no positions would be eliminated, formalized in a memorandum to the union, was critical for getting veteran developers to participate willingly in the review process. Developers who fear for their jobs don't share institutional knowledge freely.

Discussion Questions

Sandra's project had a natural deadline (Marcus's retirement). How would you prioritize an AI documentation project without such a forcing function? What metrics would you use to justify the investment?
The AI accuracy for data flow analysis was the lowest at 78%. Why is data flow analysis particularly challenging for AI in COBOL systems? What additional context might improve accuracy?
The union's concern about workforce displacement was addressed through a formal memorandum. Is this sufficient in the long term? How might AI tools change the mainframe workforce composition over five to ten years, even without explicit position reductions?
The Inspector General accepted AI-generated, human-reviewed documentation. Should there be a different standard for AI-generated documentation in systems processing benefit payments for vulnerable populations? Why or why not?
Sandra chose to use an on-premises AI solution (IBM watsonx) to address security concerns. What trade-offs does this involve compared to using cloud-based AI services that might have more advanced models?