Case Study 2: Scaling from One to Ten Agents

Background

DevForge is a developer tools company that builds a cloud-based IDE (integrated development environment) used by over 50,000 developers. The platform consists of a React-based frontend, a Go backend providing the editor and workspace services, a Python-based AI service layer, and a PostgreSQL database. The engineering team has 35 developers across four squads: Editor, Infrastructure, AI Platform, and Growth.

Priya Sharma, the VP of Engineering, has been gradually introducing AI coding agents into DevForge's development workflow over the past year. The journey started with a single agent doing code generation tasks and evolved through several stages into a ten-agent system handling the majority of routine development work. This case study documents each stage of that evolution, the problems encountered, and the solutions that emerged.

Stage 1: The Single Agent (Months 1-3)

DevForge begins with a single AI coding agent used by individual developers as a productivity tool. Developers interact with the agent in their terminals, asking it to write functions, debug issues, and generate boilerplate code.

What Works

The single agent handles well-scoped tasks effectively: - Writing new API endpoints following existing patterns - Generating database migration scripts from schema descriptions - Fixing bugs when the developer provides a clear reproduction case - Writing unit tests for existing functions

Where It Breaks Down

Priya notices three recurring problems during the first quarter:

Problem 1: Context loss on large features. When a developer asks the agent to implement a feature spanning five or six files, the agent's quality degrades noticeably after the third file. It forgets design decisions from the first file and introduces inconsistencies. One developer reports that the agent designed a REST API with paginated list endpoints in the first file, then generated a client library in the fourth file that did not handle pagination.

Problem 2: Self-review blindness. Developers who ask the agent to "write the code and review it" receive universally positive reviews. The agent approves its own code because it shares the same assumptions and blind spots that produced the code. When a human reviewer examines agent-generated code, they find issues the agent missed in its self-review -- missing error handling, hardcoded values, inconsistent naming.

Problem 3: Test-implementation coupling. When the agent writes both the implementation and the tests in the same session, the tests tend to verify the implementation rather than the requirements. If the implementation has a subtle bug, the tests encode that bug as expected behavior. When a different developer later fixes the bug, the tests fail -- not because the fix is wrong, but because the tests were testing the bug.

These problems motivate the move to multiple agents.

Stage 2: The Core Four (Months 4-6)

Based on the multi-agent patterns described in Chapter 38, Priya's team sets up a four-agent system: Architect, Coder, Tester, and Reviewer. They use a sequential pipeline orchestrated by a Python script.

Configuration

Each agent receives a system prompt tailored to DevForge's stack and conventions:

Architect: Knows about DevForge's microservices architecture, the API gateway patterns, the event-driven communication between services, and the deployment topology. Produces design documents in a standard template that the team already uses for human-written design docs.

Coder: Has access to the full codebase and follows DevForge's style guide, which includes Go conventions for the backend, Python conventions for the AI layer, and TypeScript conventions for the frontend. Can read files and write new ones.

Tester: Writes tests using DevForge's testing frameworks (Go testing for backend, pytest for Python, Jest for TypeScript). Prompted to think adversarially about edge cases, race conditions, and failure modes specific to DevForge's distributed architecture.

Reviewer: Reviews against DevForge's internal code review checklist, which covers security (OWASP top 10 for web applications), performance (must handle 10,000 concurrent users), observability (all services must emit structured logs and metrics), and accessibility (WCAG 2.1 AA compliance for frontend changes).

Results

The four-agent system produces noticeably better results than the single agent:

Design consistency: The Architect agent, with its focused system prompt, produces design documents that the human architects on the team describe as "80% of what we would write." The designs follow existing patterns and identify integration points that the single agent missed.
Bug detection: The Tester agent, approaching the code adversarially, catches 40-50% more bugs than the single agent's self-testing. The decoupling of writing and testing eliminates the test-implementation coupling problem.
Review quality: The Reviewer agent identifies security and performance issues that the single agent's self-review consistently missed. It catches an average of 1.2 critical issues per feature review.

New Problems

However, the four-agent system introduces its own challenges:

Problem 1: Pipeline rigidity. The sequential pipeline forces every task through all four stages, even when some stages are unnecessary. A simple typo fix goes through architecture review, implementation, testing, and code review -- wasting time and API costs on stages that add no value.

Problem 2: Cross-domain blindness. When a feature spans both the Go backend and the Python AI layer, none of the four agents have deep expertise in both. The Architect designs the interface between them, but the Coder sometimes implements the Go side with conventions that do not match the Python side's expectations. The agents are specialized by role but not by domain.

Problem 3: Feedback loop inefficiency. When the Tester finds bugs, the feedback goes back to the Coder, which fixes the code, and then the entire testing phase reruns. But the Reviewer often identifies issues that are already known from the testing phase, creating redundant feedback. The pipeline lacks coordination between the Tester and Reviewer.

Stage 3: Adding Specialist Agents (Months 7-9)

To address the limitations of the core four, Priya's team adds three specialist agents, bringing the total to seven.

Agent 5: The Security Agent

DevForge handles sensitive user data (code, credentials, API keys), making security critical. The Security agent is configured with knowledge of DevForge's security policies, the OWASP Top 10, and common vulnerability patterns in Go and Python web applications.

The Security agent runs in parallel with the Reviewer after implementation, focusing exclusively on: - Authentication and authorization bypass risks - Input validation and injection vulnerabilities - Secrets management (hardcoded keys, credentials in logs) - Data exposure in API responses - Cross-service trust boundaries

Impact: In its first month, the Security agent identifies 8 security issues across 12 features -- issues that the general Reviewer missed because it was splitting attention across security, performance, maintainability, and style. Three of these are classified as critical: an API endpoint that did not validate the caller's permissions, a logging statement that included raw authentication tokens, and a file upload handler vulnerable to path traversal.

Agent 6: The Database Agent

DevForge's PostgreSQL database serves 50,000 users with complex query patterns. The Database agent specializes in: - Schema design and migration safety (no locking migrations on large tables) - Query optimization (explaining query plans, identifying missing indexes) - Data integrity constraints - Migration rollback safety

The Database agent runs during the design phase, in parallel with the Architect, reviewing any database-related aspects of the design and producing migration scripts.

Impact: The Database agent catches a migration that would have locked the workspaces table (12 million rows) for an estimated 4 minutes during deployment. It recommends a concurrent index creation strategy that reduces lock time to milliseconds. Without this catch, the migration would have caused a 4-minute outage during the next deployment.

Agent 7: The Performance Agent

With 10,000 concurrent users as the performance target, DevForge needs to catch performance issues before they reach production. The Performance agent analyzes code for: - N+1 query patterns - Missing caching opportunities - Inefficient algorithms (quadratic loops, unnecessary allocations) - Memory leaks in long-running Go services - Frontend bundle size impact

The Performance agent runs in parallel with the Security agent and Reviewer.

Impact: The Performance agent identifies that a new search feature performs a regex match on every file in a user's workspace sequentially. For large workspaces (10,000+ files), this takes over 10 seconds. The agent recommends pre-indexing and using a trie-based search structure, reducing the query time to under 100 milliseconds.

Orchestration Changes

With seven agents, the sequential pipeline is no longer practical. The team adopts a hybrid orchestration pattern:

Phase 1 (Sequential):
  Architect + Database Agent (parallel)
  |
  v
Phase 2 (Sequential):
  Coder
  |
  v
Phase 3 (Parallel):
  Tester + Reviewer + Security Agent + Performance Agent
  |
  v
Phase 4 (Sequential):
  Feedback aggregation and conflict resolution
  |
  v
Phase 5 (Sequential):
  Coder (fix pass, if needed)

The parallel execution in Phase 3 reduces total pipeline time by approximately 35% compared to running all four analysis agents sequentially.

Conflict Resolution in Practice

With four agents analyzing the same code in Phase 3, conflicts become common. The most frequent conflict pattern:

The Performance Agent recommends caching a database query result for 60 seconds.
The Security Agent flags the same cache as a risk because the cached data includes user permissions that could become stale, allowing a brief window where revoked permissions are still honored.

The team implements an evidence-based conflict resolution strategy. Both agents must provide severity ratings and specific evidence. In this case, the Security Agent's concern is rated CRITICAL (stale permissions could allow unauthorized access), while the Performance Agent's concern is rated WARNING (slow query, but functional). The security concern wins, and the coder implements a cache with a 5-second TTL as a compromise -- fast enough to help performance, short enough that stale permissions are not a meaningful risk.

The team tracks conflict patterns over time. After two months, they identify the three most common conflict types and create resolution templates that automate the decision for future occurrences of the same pattern.

Stage 4: Domain-Based Scaling (Months 10-12)

DevForge's feature work increasingly spans multiple technology domains. A new "AI Code Review" feature requires changes to the Go backend (new API endpoints), the Python AI layer (review model integration), the React frontend (review UI), and the database (review history storage). No single Coder agent can handle all four domains with expert-level quality.

Adding Domain-Specific Coders

The team adds three domain-specific Coder agents, bringing the total to ten:

Agent 8: Backend Coder (Go). Specializes in Go development with deep knowledge of DevForge's backend patterns: HTTP handler conventions, middleware chains, gRPC service definitions, and the custom error handling framework.

Agent 9: AI Layer Coder (Python). Specializes in Python development with knowledge of DevForge's AI pipeline: model serving with FastAPI, prompt management, embedding generation, and the retrieval-augmented generation (RAG) architecture.

Agent 10: Frontend Coder (TypeScript/React). Specializes in React development with knowledge of DevForge's component library, state management patterns (Zustand), styling conventions (Tailwind CSS), and accessibility requirements.

Hierarchical Orchestration

With ten agents, the orchestration must scale. The team adopts a hierarchical pattern where the Architect agent serves as the lead, decomposing features into domain-specific sub-tasks:

Feature Request
    |
    v
[Architect] -- produces overall design
    |
    +-- Backend sub-task --> [Backend Coder] --> [Tester] }
    |                                                      } parallel
    +-- AI Layer sub-task --> [AI Coder] --> [Tester]      } per domain
    |                                                      }
    +-- Frontend sub-task --> [Frontend Coder] --> [Tester] }
    |
    v
[Database Agent] -- reviews schema changes from all domains
    |
    v
[Security Agent + Performance Agent + Reviewer] -- parallel review
    |
    v
[Architect] -- integration review and final assembly

Each domain-specific Coder works on its sub-task independently, producing code that implements its portion of the Architect's design. The Tester agent runs against each domain's output separately, using the appropriate test framework for each language.

Integration Testing

The biggest challenge at this scale is integration. Each domain-specific Coder produces code that works in isolation, but the pieces must fit together. The team configures the Tester agent to run a second pass specifically focused on cross-domain integration:

Does the frontend correctly call the new backend API endpoints?
Does the backend correctly invoke the AI layer's new endpoints?
Are the data formats consistent across service boundaries?
Do the database migrations support all three services' requirements?

The integration testing pass catches a subtle mismatch: the Backend Coder defines a JSON response field as review_comments (snake_case, following Go conventions), while the Frontend Coder expects reviewComments (camelCase, following JavaScript conventions). The integration test fails on the deserialization, and the feedback loop routes the issue to the Backend Coder, which adds the appropriate JSON struct tags.

Dynamic Team Composition

Not every feature needs all ten agents. The team implements dynamic composition based on which files a feature touches:

def select_agents(feature):
    agents = [architect, reviewer]  # Always present

    if feature.touches_go_files:
        agents.append(backend_coder)
    if feature.touches_python_files:
        agents.append(ai_coder)
    if feature.touches_frontend_files:
        agents.append(frontend_coder)
    if feature.touches_database:
        agents.append(database_agent)
    if feature.security_sensitive:
        agents.append(security_agent)
    if feature.performance_sensitive:
        agents.append(performance_agent)

    agents.append(tester)  # Always present
    return agents

A simple Go backend change might use only 4 agents (Architect, Backend Coder, Tester, Reviewer). A full-stack feature uses all 10. This keeps the coordination tax proportional to task complexity.

Scaling Challenges and Solutions

Challenge 1: Cost Management

At ten agents, pipeline costs escalate. A full-stack feature with all ten agents, including feedback loops, costs approximately $8-12 per run. With 15-20 features per week, the weekly agent cost is $120-240.

Solution: The team implements model tiering. The Architect and Security Agent use the most capable model for their complex reasoning tasks. The domain-specific Coders use a mid-tier model that produces high-quality code at lower cost. The Tester uses the mid-tier model. The Reviewer uses the capable model for its analytical work. This reduces per-run costs by approximately 40% without measurable quality degradation.

Challenge 2: Context Window Pressure

With ten agents producing artifacts, the total context passed between stages grows large. A full-stack feature's design document, three implementation outputs, and database migration can easily exceed 15,000 tokens -- consuming a significant portion of each downstream agent's context window.

Solution: The team implements aggressive context summarization. Each downstream agent receives a role-appropriate summary rather than the full upstream output. The Tester receives interface definitions and expected behaviors, not the full design rationale. The Security Agent receives the code and the security-relevant constraints, not the performance analysis. This reduces per-agent context consumption by 50-60%.

Challenge 3: Debugging Pipeline Failures

When a ten-agent pipeline fails at stage 7, finding the root cause is difficult. Is the failure due to a bad design from the Architect? An implementation error from the domain Coder? A flawed test from the Tester? A false positive from the Security Agent?

Solution: The team builds a pipeline dashboard that shows: - Execution trace with timing for each agent - Token usage and cost per agent - Conflict log with resolutions - Feedback loop count and content - Full input and output for each agent (expandable on click)

The dashboard reduces debugging time from 30-45 minutes of log reading to 5-10 minutes of visual trace analysis.

Challenge 4: Maintaining Consistency Across Domain Coders

Three different Coder agents producing code for the same feature can drift in style, naming, and conventions. The Backend Coder names a function CreateReview, the AI Coder names the equivalent create_review, and the Frontend Coder names it createReview. While each follows its language's conventions, the cross-service naming inconsistency causes confusion.

Solution: The Architect's design document now includes a "Cross-Domain Naming Convention" section that specifies the canonical name for each concept (e.g., "review" not "code_review" or "ai_review") and the expected transformation rules for each language. Each domain Coder's system prompt references this section.

Results After One Year

After twelve months of iterative development, DevForge's ten-agent system produces measurable improvements:

Metric	Single Agent (Month 1)	Core Four (Month 6)	Ten Agents (Month 12)
Bugs found before merge	30% of total	65% of total	88% of total
Security issues found	Rare	60% catch rate	92% catch rate
Avg. pipeline cost per feature	$0.40 \| $1.80	$5.20 (with tiering)
Avg. pipeline time	3 min	10 min	14 min
Human review time per feature	45 min	25 min	12 min
Production incidents per quarter	8	4	1

The most significant impact is on production incidents. Going from 8 incidents per quarter to 1 saves far more engineering time and customer trust than the pipeline costs consume. The single production incident in the most recent quarter was an infrastructure issue unrelated to code quality.

Human review time drops from 45 minutes to 12 minutes because human reviewers now focus on business logic correctness and architectural judgment -- the things agents cannot yet do well -- rather than checking for security vulnerabilities, performance issues, test coverage, and style compliance, which the agents handle comprehensively.

Key Takeaways from the Scaling Journey

Scale incrementally. Starting with ten agents would have been overwhelming. Each stage solved specific problems identified from the previous stage, and the team had time to learn the coordination patterns before adding complexity.

Match agents to pain points. Every new agent was added because of a specific, documented problem. The Security Agent was added after security issues slipped through. The Database Agent was added after a near-miss with a locking migration. The domain-specific Coders were added when cross-domain features revealed quality gaps. Never add agents speculatively.

Invest in orchestration infrastructure. The orchestration layer -- dynamic team composition, parallel execution, conflict resolution, monitoring dashboard -- is as important as the agents themselves. Without it, ten agents would produce chaos rather than quality.

Monitor ruthlessly. Cost tracking, quality metrics, and execution traces are essential for understanding whether each agent earns its keep. Two months after adding the Performance Agent, the team evaluates its impact and confirms that it catches an average of 1.8 significant performance issues per week -- well worth its cost.

Maintain escape hatches. Not every feature runs through the ten-agent pipeline. Emergency hotfixes go through a single-agent fast path. Experimental prototypes skip the security and performance review. The system is flexible enough to match the process to the situation, not the other way around.