Case Study 01: Building an Autonomous PR Generator

An Agent That Reads Issues and Generates Pull Requests

Background

DevStream, a mid-sized SaaS company with a team of 40 engineers, maintained a monorepo containing a Python-based REST API, a React frontend, and a suite of shared libraries. The team used GitHub for version control and had a healthy backlog of issues---ranging from straightforward bug fixes to minor feature additions. Every sprint, developers spent a significant portion of their time on what they called "low-complexity, high-certainty" issues: tasks where the solution was clear, the changes were small, and the risk was low. Tasks like updating a deprecated function call across multiple files, adding a new field to an API response, or fixing off-by-one errors in pagination logic.

The engineering manager, Priya, estimated that these routine issues consumed roughly 25% of developer time each sprint. She asked the team's platform engineer, Marcus, to explore whether an AI coding agent could handle these tasks autonomously, freeing developers for more complex work.

The Challenge

Marcus needed to build an agent that could:

Read a GitHub issue and understand the task
Explore the relevant parts of the codebase
Plan and implement a solution
Write or update tests
Run the test suite and fix any failures
Open a pull request with a clear description

The agent needed to be safe (it must not break existing functionality), efficient (it should not waste API credits on unnecessary exploration), and transparent (developers reviewing the PR should understand what the agent did and why).

Architecture and Design

Marcus chose a three-layer architecture:

Layer 1: Issue Analyzer The first layer received a GitHub issue and extracted structured information: the type of task (bug fix, feature, refactor), the affected components, relevant file paths mentioned in the issue, and acceptance criteria. This layer used an LLM call with a carefully crafted prompt:

ISSUE_ANALYSIS_PROMPT = """Analyze this GitHub issue and extract:
1. Task type: bug_fix, feature, refactor, or documentation
2. Affected components: list of modules/files likely involved
3. Acceptance criteria: what must be true for this to be resolved
4. Estimated complexity: low, medium, or high
5. Key search terms: words/phrases to search the codebase for

Issue title: {title}
Issue body: {body}
Issue labels: {labels}

Return the analysis as structured JSON."""

If the estimated complexity was "high," the agent stopped and left a comment on the issue explaining that the task appeared too complex for autonomous resolution. This was the first guardrail.

Layer 2: Codebase Explorer and Planner The second layer used the extracted information to explore the codebase and formulate a plan. It employed a focused exploration strategy:

Search for files matching the affected components
Read the most relevant files (limited to 10 files to control context size)
Search for the key terms identified in Layer 1
Read the existing test files for the affected components
Generate a detailed implementation plan

The plan was structured as an ordered list of file modifications:

@dataclass
class FileModification:
    """A planned modification to a single file."""
    file_path: str
    modification_type: str  # "edit", "create", "delete"
    description: str
    dependencies: list[str]  # Other file modifications this depends on

@dataclass
class ImplementationPlan:
    """A complete plan for resolving an issue."""
    summary: str
    modifications: list[FileModification]
    test_modifications: list[FileModification]
    verification_steps: list[str]

Layer 3: Executor and Verifier The third layer executed the plan step by step, running the test suite after each significant change. This was the core agent loop:

def execute_plan(plan: ImplementationPlan, repo_path: str) -> ExecutionResult:
    """Execute an implementation plan with verification."""
    branch_name = f"agent/issue-{issue_number}"
    create_branch(branch_name)

    for modification in plan.modifications:
        apply_modification(modification)

        # Run targeted tests after each change
        relevant_tests = find_relevant_tests(modification.file_path)
        test_result = run_tests(relevant_tests)

        if not test_result.passed:
            # Attempt to fix up to 3 times
            fixed = attempt_fix(modification, test_result, max_attempts=3)
            if not fixed:
                rollback_to_last_good_state()
                return ExecutionResult(
                    status="partial_failure",
                    message=f"Failed to fix test failures for {modification.file_path}"
                )

    # Final verification: run the full test suite
    full_test_result = run_tests("all")
    if not full_test_result.passed:
        return ExecutionResult(
            status="test_failure",
            message="Full test suite failed after all modifications"
        )

    # Create the pull request
    pr = create_pull_request(
        branch=branch_name,
        title=generate_pr_title(plan),
        body=generate_pr_body(plan, modifications_applied),
        issue_number=issue_number
    )
    return ExecutionResult(status="success", pr_url=pr.url)

Guardrails

Marcus implemented multiple layers of safety:

Scope limitation: The agent could only work on issues labeled agent-eligible. A human had to apply this label, ensuring someone had reviewed the issue and judged it appropriate for autonomous resolution.

File restrictions: The agent could not modify files in config/, .env*, deployment/, or any file containing credentials. It could only modify .py, .js, .tsx, .json, .yaml, and .md files.

Change size limits: If the agent's plan involved modifying more than 15 files or adding more than 500 lines of code, it would stop and request human review of the plan before proceeding.

Test requirements: The agent was required to run the full test suite before opening a PR. A PR was only opened if all tests passed.

Cost ceiling: Each issue resolution was budgeted at 200,000 tokens (approximately $3 at the team's API pricing). If the budget was exhausted, the agent stopped and reported what it had accomplished.

Rollback capability: Every change was made on a feature branch. If anything went wrong, the branch could simply be deleted with no impact on the main codebase.

Results: The First Month

Marcus ran the agent on a trial basis for one month, targeting only issues labeled agent-eligible. The results:

Volume: 47 issues were labeled agent-eligible during the trial period.

Completion rate: The agent successfully opened PRs for 38 of 47 issues (81%). Of the 9 failures: - 4 were due to the agent exceeding the complexity threshold and correctly declining - 3 were due to test failures the agent could not resolve - 2 were due to the agent misunderstanding the issue requirements

PR quality: Of the 38 PRs opened: - 29 were merged with no modifications (76%) - 7 required minor changes (variable naming, comment adjustments) - 2 required significant rework and were ultimately completed by a human developer

Time savings: The 29 clean PRs represented tasks that would have taken developers an estimated 2--4 hours each. At an average of 3 hours per task, the agent saved approximately 87 developer-hours in the first month.

Cost: The agent consumed approximately 6.2 million tokens across all 47 attempts, costing approximately $93. At an internal developer cost of $75/hour, the 87 hours saved represented $6,525 in value---a roughly 70:1 return on the API cost.

Lessons Learned

1. Issue quality determines agent success. The most common cause of agent failure was ambiguous or incomplete issue descriptions. Marcus added an issue template specifically for agent-eligible issues that required specific fields: affected files, expected behavior, and acceptance criteria. After implementing the template, the completion rate rose from 81% to 89%.

2. Focused exploration beats exhaustive exploration. Early versions of the agent tried to understand the entire codebase before making changes. This consumed the token budget quickly and often confused the model with irrelevant information. The focused exploration strategy---search first, then read only relevant files---was both cheaper and more effective.

3. Incremental verification catches errors early. Running tests after each file modification rather than waiting until all changes were made was crucial. When a change broke something, the agent knew exactly which modification caused the problem and could focus its repair efforts.

4. The PR description is as important as the code. Developers reviewing agent-generated PRs needed to understand the agent's reasoning, not just the changes. Marcus invested significant effort in generating detailed PR descriptions that explained why each change was made, not just what changed. This dramatically improved reviewer confidence and reduced review time.

5. Start with a narrow scope and expand gradually. Beginning with only agent-eligible issues gave the team confidence in the system before expanding its autonomy. After the successful trial, the team expanded the criteria for agent-eligible to include slightly more complex tasks.

Technical Deep Dive: The Self-Healing Test Loop

The most technically interesting aspect of the agent was its ability to fix its own test failures. When a test failed after a code change, the agent:

Read the full test output, including the failure message and stack trace
Identified whether the failure was in the agent's new code or in an existing test that the new code broke
If the failure was in new code: analyzed the error, modified the code, and reran the test
If the failure was in an existing test: analyzed whether the test's expectations needed updating (because the behavior intentionally changed) or whether the code change was incorrect

This three-attempt self-healing loop resolved test failures in 73% of cases. The remaining 27% were escalated as partial failures, with the agent documenting what it had tried and why it failed.

def attempt_fix(
    modification: FileModification,
    test_result: TestResult,
    max_attempts: int = 3
) -> bool:
    """Attempt to fix a test failure caused by a modification."""
    for attempt in range(max_attempts):
        analysis = analyze_test_failure(
            modification=modification,
            failure_output=test_result.output,
            attempt_number=attempt
        )

        if analysis.fix_type == "modify_source":
            apply_source_fix(analysis.suggested_fix)
        elif analysis.fix_type == "update_test":
            apply_test_fix(analysis.suggested_fix)
        elif analysis.fix_type == "revert_and_retry":
            revert_modification(modification)
            alternative = generate_alternative_approach(modification)
            apply_modification(alternative)

        test_result = run_tests(find_relevant_tests(modification.file_path))
        if test_result.passed:
            return True

    return False

Long-Term Impact

Six months after the trial, the autonomous PR generator had become an integral part of DevStream's development workflow. The team expanded the system in several ways:

Triage integration: New issues were automatically analyzed for agent eligibility, with the agent recommending the agent-eligible label based on complexity analysis
Multi-language support: The agent was extended to handle TypeScript frontend issues in addition to Python backend issues
Reviewer assignment: Agent-generated PRs were automatically assigned to the developer most familiar with the modified files, based on git blame analysis
Feedback loop: When developers made changes to an agent-generated PR before merging, those changes were logged and used to improve future prompt engineering

The team estimated that the agent handled approximately 30% of their total issue volume, saving an average of 120 developer-hours per month. Perhaps more importantly, developer satisfaction surveys showed increased morale, as engineers spent less time on repetitive tasks and more time on challenging, creative work.

Reflection Questions

What additional guardrails would you add if the agent were to handle issues labeled "medium" complexity?
How would you modify the architecture to support a team working across multiple repositories?
What metrics would you track to detect a gradual decline in agent quality over time?
How would you handle issues that require changes to both the Python backend and the React frontend in a single PR?
What is the risk of developers becoming over-reliant on the agent for routine tasks, and how would you mitigate it?