Case Study 2: Raj's Open Source Compliance Audit

AI-Generated Code and License Risk

Note

This case study is for educational purposes and does not constitute legal advice. Open source license compliance questions should be reviewed by qualified IP counsel for specific codebases and use cases.

Persona: Raj (Software Developer / Team Lead) Domain: Software development, commercial product Context: Proactive open source compliance audit for AI-generated code Decision: Documented audit, targeted remediation, ongoing IP documentation practice Outcome: IP defensibility established; ongoing code documentation practice adopted; IP counsel consultation completed

Background

Raj worked at a company that built and licensed B2B software. The product was a workflow management platform sold to enterprise customers under proprietary commercial licenses. Their customers were sophisticated buyers who sometimes asked about IP provenance and warranty.

Over the previous eighteen months, the development team had significantly increased their use of AI code generation tools. GitHub Copilot was in use across the team; some developers also used Claude and ChatGPT for specific coding tasks. The tools had improved development velocity noticeably.

Then the company's legal team sent a memo that prompted the audit Raj describes in this case study.

The Prompt

The legal team had received a question from a prospective enterprise customer. The customer was a large financial services company with sophisticated IP compliance processes. As part of their vendor due diligence, they asked the company to confirm: "Does your software contain any open source components licensed under the GPL or similar copyleft licenses, and if yes, how are those components managed?"

This was a standard question the company had answered before for its traditional open source dependencies — it maintained a software bill of materials (SBOM) that documented known open source components and their licenses. The problem was that the SBOM didn't account for AI-generated code, which might include content substantially derived from GPL-licensed training material.

The legal team's memo was not alarmist: "This is a theoretical risk that may or may not materialize into a real obligation. But we need to understand what our exposure is before we represent to customers that we have no copyleft obligations we haven't addressed."

Raj was asked to lead the technical side of the assessment.

The Audit Process

Raj worked through the audit in four steps.

Step 1: Documentation of AI-generated code scope.

He started by asking the question no one had formally answered: how much of the codebase was AI-generated, and which components?

The answer was not precisely knowable retrospectively. AI code generation doesn't tag itself in version control. But he could reconstruct it reasonably: he reviewed commit histories for the past 18 months, interviewed team members about their AI tool use by codebase area, and identified the components where AI generation was most likely to have occurred.

The result: a rough map of high-AI-generation areas (mostly utility code, data processing pipelines, API integration modules) and low-AI-generation areas (core business logic, customer-facing features that had been heavily customized through developer judgment).

Step 2: Risk prioritization.

Not all AI-generated code carries the same license risk. The copyleft risk is highest for code where: - The AI tool's training data included substantial GPL-licensed code in the same domain - The generated code closely mirrors the structure of known open source implementations - The specific algorithms or patterns used are distinctively associated with open source projects

Raj identified the highest-priority areas: data processing utilities that resembled common patterns from widely-used GPL-licensed libraries, and several parsing utilities that were similar to patterns from a specific well-known open source project.

Step 3: Code review for substantial similarity.

For the high-priority areas, Raj did a technical review comparing the AI-generated code to the open source projects that might have influenced it. He was not looking for exact copying — he was looking for structural similarity at the algorithm and implementation level that might indicate substantial derivation.

He found three segments that warranted concern: - A CSV parsing utility that was structurally very similar to a pattern from a GPL v2 licensed parsing library - A data transformation pipeline that closely resembled the architecture of a well-known GPL-licensed ETL framework - A string processing function that was nearly identical to an implementation in an Apache-licensed project (not copyleft, but he flagged it anyway for completeness)

He documented his findings with specificity: which files, which functions, what the structural similarity was, and what open source project it resembled.

Step 4: Legal counsel consultation.

Raj brought his documentation to the company's IP attorney. He had three questions:

Do these similarities create copyleft obligations?
If we replace these segments with clearly original code, does that adequately remediate the risk?
What ongoing practices should we adopt for AI-generated code?

The Legal Assessment

The IP attorney's assessment (summarized; not legal advice for any specific situation):

On the three flagged segments: The similarity Raj identified was noteworthy but did not definitively establish copyright infringement or copyleft attachment. Substantial similarity analysis for software copyright is a complex fact-specific inquiry. The question of whether AI-generated code that resembles GPL-licensed code creates copyleft obligations is a novel legal question without definitive precedent as of 2026. However:

The CSV parsing utility was the highest risk segment, given the closeness of the structural similarity
The ETL pipeline was the second priority
The Apache-licensed string function was lower risk (Apache 2.0 does not have copyleft provisions)

Recommended approach: Rewrite the two highest-priority segments with deliberate differentiation from the open source patterns, using human-authored implementations that the company can clearly defend as original. This remediation strategy reduces risk without requiring a company-wide statement that prior code had any specific problem.

On ongoing practices: The attorney recommended: - Adding AI tool use to the company's standard SBOM process — documenting which code is AI-generated - Using GitHub Copilot's enterprise tier, which included IP indemnification commitments, for production code - Including an IP review step in the code review process for AI-generated code in core components - Consulting IP counsel annually on this evolving area

The Remediation

The two highest-priority segments were rewritten by Raj and a senior developer. They approached the rewrites deliberately: they designed the implementations from first principles without reference to the similar open source patterns, and documented the design rationale.

The rewrites took approximately two days of engineering time. The resulting code was different in structure and approach from both the AI-generated version and the open source patterns it had resembled.

The documentation package for the customer due diligence inquiry now included: a description of the AI code generation audit, the remediation steps taken, the company's ongoing IP documentation practice, and the enterprise tier tooling in use. The customer's procurement team accepted this documentation as satisfying their IP due diligence requirement.

The Ongoing Practice

After the audit, Raj's team adopted three ongoing practices:

1. AI code generation tagging in version control. Commits that include substantial AI-generated code are tagged in the commit message. This creates a searchable history and makes future audits tractable.

2. IP review checkpoint for core components. When AI-generated code is added to components that are core to the product's licensed functionality, it goes through a brief IP review before merging — specifically checking whether it resembles known open source implementations in ways that warrant attention.

3. Enterprise tool requirement for production code. The company standardized on enterprise-tier AI tools with IP indemnification provisions for production code generation. Consumer tools (free tiers, personal accounts) are permitted for development experimentation but not for code that goes into the production codebase.

These practices added minimal overhead to the development workflow. The IP tagging took seconds per commit. The review checkpoint was integrated into the existing code review process and added approximately 15 minutes to the review for affected commits.

What the Audit Revealed Beyond the IP Question

One unexpected finding from the audit process: the documentation exercise revealed that the team had never systematically thought about the provenance of AI-generated code. They had treated AI tools as sophisticated autocomplete — which, in one sense, they are — without thinking about the intellectual property chain of the outputs.

This gap was not unique to IP: the same lack of systematic thinking about AI-generated code provenance would be relevant for security (was the AI-generated code reviewed for security vulnerabilities at the same standard as human-written code?), for correctness (was AI-generated code subject to the same test coverage requirements?), and for maintainability (was AI-generated code documented in ways that allowed the team to maintain it without depending on AI assistance?).

The IP audit became the catalyst for a broader conversation about code quality practices for AI-generated code that benefited the team beyond just IP compliance.

Lessons

1. AI-generated code has IP provenance questions that traditional SBOM processes don't address. Companies that maintain SBOMs for open source compliance need to extend their analysis to include AI-generated code if they are serious about IP due diligence.

2. Proactive audits before customer inquiries are far preferable to reactive ones. Raj's company ran the audit before the customer asked for full IP confirmation, which meant they had time to remediate before any representation was required. A reactive audit after a customer has received a potentially incorrect representation creates a more difficult situation.

3. The theoretical risk is real but manageable. Open source copyleft contamination through AI-generated code is a theoretical, not definitively established, legal risk. Treating it as manageable through reasonable practices (documentation, enterprise tooling, targeted review) is more calibrated than either ignoring it or treating all AI-generated code as compromised.

4. IP counsel consultation is appropriate for material commercial IP questions. Self-educated legal literacy is valuable for daily decisions. Material questions about the IP chain of title in commercially licensed software warrant the judgment of a qualified IP attorney.

5. IP documentation practices created beneficial side effects. The discipline of documenting AI-generated code improved code quality practices, security review processes, and test coverage discussions — benefits that extended beyond the original IP compliance motivation.

Related: Chapter 34, Section 3 (Open source AI licenses and code), Section 7 (Risk management framework), Section 8 (When to involve legal counsel)

Return to: Case Study 1: Elena's Client Data Crisis — When Confidential Information Almost Went Into ChatGPT