30 min read

> "One agent is a tool. A team of agents is a workforce." -- Emerging principle in AI-assisted software engineering

Chapter 38: Multi-Agent Development Systems

"One agent is a tool. A team of agents is a workforce." -- Emerging principle in AI-assisted software engineering


Learning Objectives

After completing this chapter, you will be able to:

  • Explain why multiple specialized agents outperform a single general-purpose agent for complex development tasks (Bloom's: Understand)
  • Design distinct agent roles -- architect, coder, tester, reviewer -- with clear responsibilities, capabilities, and boundaries (Bloom's: Create)
  • Analyze orchestration patterns (sequential pipeline, parallel execution, hierarchical delegation) and select the appropriate pattern for a given workflow (Bloom's: Analyze)
  • Apply inter-agent communication strategies including shared context, message passing, and artifact exchange (Bloom's: Apply)
  • Evaluate conflict resolution strategies when agents produce contradictory recommendations (Bloom's: Evaluate)
  • Create automated workflows that take a development task from issue to pull request using multiple coordinated agents (Bloom's: Create)
  • Design quality assurance pipelines that use cross-agent verification to catch errors no single agent would find (Bloom's: Create)
  • Analyze the scaling challenges that emerge as agent teams grow beyond four or five members (Bloom's: Analyze)
  • Apply monitoring and observability practices to track agent performance, identify bottlenecks, and debug failures (Bloom's: Apply)
  • Create a complete multi-agent development pipeline that integrates all concepts from this chapter (Bloom's: Create)

Prerequisites

This chapter assumes you have completed:

  • Chapter 36: AI Coding Agents -- You understand how a single AI agent operates, including tool use, planning loops, and the agent execution cycle.
  • Chapter 37: Custom Tools and MCP Servers -- You know how to build custom tools that agents can invoke and how MCP (Model Context Protocol) enables standardized tool access.

If you skipped those chapters, you can still follow this material, but the concepts here build directly on the single-agent foundations and tool-building skills established in Chapters 36 and 37.


Introduction

In Chapter 36, you learned how a single AI coding agent operates: it receives a task, reasons about how to accomplish it, uses tools to read files, write code, and run tests, and iterates until the task is complete. That single-agent model is remarkably powerful. It can handle many development tasks that would take a human developer hours.

But single agents have limits.

Ask a single agent to design a system architecture, implement all the components, write comprehensive tests, and review the code for security vulnerabilities -- all in one session -- and you will start to see problems. The agent's context window fills up. Its attention drifts from architectural consistency to low-level implementation details. It forgets design decisions it made earlier. It lacks the focused expertise that comes from specialization.

This is the same problem that human software teams solved decades ago. No single developer does everything. Teams have architects, implementers, testers, and reviewers. Each role brings a different perspective, a different set of priorities, and a different kind of expertise. The architect thinks about system-level trade-offs. The implementer thinks about clean, efficient code. The tester thinks about what could go wrong. The reviewer thinks about maintainability and standards compliance.

Multi-agent development systems apply this same principle to AI. Instead of one agent doing everything, you create a team of specialized agents that collaborate on a shared task. Each agent has a defined role, a specific set of tools, and a focused system prompt that keeps it on track. An orchestrator coordinates their work, manages the flow of information between them, and resolves conflicts when they arise.

This chapter teaches you how to design, build, and operate multi-agent development systems. You will learn how to define agent roles, choose orchestration patterns, implement inter-agent communication, handle conflicts, automate end-to-end workflows, ensure quality, scale agent teams, and monitor everything. By the end, you will build a complete multi-agent pipeline that takes a feature request and produces a reviewed, tested pull request -- with minimal human intervention.


38.1 Why Multiple Agents?

The Limits of a Single Agent

A single AI coding agent, no matter how capable, faces three fundamental constraints that limit its effectiveness on complex tasks.

Context window saturation. Every agent operates within a finite context window. As a task grows in complexity, the agent must hold more information simultaneously: the task description, relevant source files, generated code, test results, error messages, and its own reasoning history. Eventually, this information exceeds what the context window can hold, and the agent begins to lose track of earlier decisions and context.

Attention diffusion. Even within a context window that is not yet full, agents perform better when focused on a narrow task than when juggling multiple concerns simultaneously. An agent that is simultaneously thinking about API design, database schema, error handling, and test coverage will produce lower-quality output in each area than four agents each focused on one concern.

Role confusion. When a single agent plays multiple roles, it often fails to maintain the critical distance that each role requires. An agent that wrote the code is poorly positioned to review it critically. An agent that designed the architecture is biased toward defending its design rather than testing it rigorously. Separation of roles creates the adversarial tension that catches bugs.

Key Concept: The Specialization Advantage

Multiple specialized agents outperform a single generalist agent for the same reason that a team of human specialists outperforms a single generalist: each specialist maintains deeper expertise in its domain, applies more focused attention to its tasks, and brings a distinct perspective that acts as a check on the others. The total capability of the team exceeds the sum of its parts because specialization enables depth and separation enables independence.

The Division of Labor Analogy

Consider how a professional software team operates. The product manager writes requirements. The architect designs the system. The developer implements it. The QA engineer tests it. The code reviewer checks quality. Each person is optimized for their role. The product manager does not write code. The developer does not write test plans. The reviewer does not implement features.

Multi-agent systems replicate this division of labor. Each agent is given a system prompt that focuses it on a specific role, a set of tools relevant to that role, and instructions that prevent it from overstepping its boundaries. The architect agent does not write implementation code. The coder agent does not make architectural decisions. The tester agent does not modify the code under test.

When Multiple Agents Make Sense

Not every task requires multiple agents. A simple function generation, a quick bug fix, or a small script is perfectly suited to a single agent. Multiple agents add coordination overhead, and that overhead is only justified when the benefits of specialization outweigh it.

Multiple agents make sense when:

  • The task spans multiple development phases (design, implementation, testing, review)
  • The codebase is large enough that no single context window can hold all relevant information
  • The task requires adversarial perspectives (someone to write the code and someone else to try to break it)
  • You need parallel execution to reduce total completion time
  • The task involves cross-cutting concerns (security, performance, accessibility) that benefit from specialist attention
  • You want reproducible workflows that can be automated and run repeatedly

When to Use It

Start with a single agent. When you notice that the agent is losing context, producing inconsistent results, or failing to catch its own mistakes, that is your signal to consider splitting the work across multiple agents. The goal is not to use as many agents as possible -- it is to use the minimum number of agents needed to produce reliable results.

Quantifying the Improvement

In practice, multi-agent systems show measurable improvements over single-agent approaches in several areas:

  • Defect detection rate: A separate tester agent catches 40-60% more bugs than an agent that tests its own code, because it approaches the code without knowledge of implementation shortcuts.
  • Architectural consistency: An architect agent that reviews implementation against the original design catches drift that an implementing agent would not notice.
  • Code review quality: A reviewer agent with a focused review prompt identifies more style violations, security issues, and maintainability concerns than an agent that reviews its own work.
  • Total throughput: Parallel execution of independent tasks (such as writing tests and documentation simultaneously) can reduce total pipeline time by 30-50%.

These numbers vary by task complexity, agent configuration, and model capability, but the general pattern holds: specialization and separation improve quality.


38.2 Agent Role Design: Architect, Coder, Tester, Reviewer

The Core Four Roles

The most common multi-agent development team consists of four roles that mirror a traditional software team. Each role has a distinct purpose, a specific perspective on the codebase, and a unique set of tools.

The Architect Agent designs the system. It receives a high-level requirement and produces a technical design document that specifies the components, their interfaces, the data flow, and the key design decisions. The architect thinks about scalability, maintainability, and consistency with existing patterns in the codebase.

System Prompt (Architect Agent):
You are a senior software architect. Your role is to design systems,
not to implement them. Given a feature request, you produce:
1. A component breakdown with responsibilities
2. Interface definitions (function signatures, class APIs)
3. Data flow diagrams (described textually)
4. Key design decisions with rationale
5. Constraints and requirements for implementation

You do NOT write implementation code. You do NOT write tests.
You focus on structure, interfaces, and design trade-offs.
When you see an existing codebase, analyze its patterns and ensure
your design is consistent with established conventions.

The Coder Agent implements the design. It receives the architect's design document and produces working code that matches the specified interfaces and follows the design decisions. The coder focuses on clean, efficient, correct implementation. It follows the project's coding standards and uses established patterns.

System Prompt (Coder Agent):
You are a senior software developer. Your role is to implement
designs, not to create them. Given an architectural design document:
1. Implement each component exactly as specified
2. Follow the interfaces defined by the architect
3. Write clean, well-documented code with type hints
4. Follow PEP 8 and the project's existing conventions
5. Handle edge cases and errors gracefully

You do NOT redesign the architecture. If you believe the design
has a flaw, document it as a comment but implement as specified.
You do NOT write tests. You focus purely on implementation quality.

The Tester Agent writes and runs tests. It receives the architect's design and the coder's implementation and produces comprehensive tests that verify correctness, edge cases, and error handling. The tester approaches the code as an adversary -- its job is to find ways the code can fail.

System Prompt (Tester Agent):
You are a senior QA engineer. Your role is to test code, not to
write or design it. Given a design document and implementation:
1. Write unit tests for every public function and method
2. Write integration tests for component interactions
3. Test edge cases, boundary conditions, and error paths
4. Test with invalid, empty, null, and unexpected inputs
5. Verify the implementation matches the design specification

You do NOT modify the implementation code. If tests fail, report
the failures with details. Your goal is to find bugs, not fix them.
Think adversarially: how can this code break?

The Reviewer Agent checks quality. It receives the complete package -- design, implementation, and test results -- and performs a thorough code review. The reviewer checks for style consistency, security vulnerabilities, performance issues, maintainability concerns, and adherence to best practices.

System Prompt (Reviewer Agent):
You are a senior code reviewer. Your role is to evaluate code
quality, not to write code. Given a design, implementation, and
test results, you review for:
1. Code quality: naming, structure, readability, DRY compliance
2. Security: injection risks, authentication gaps, data exposure
3. Performance: unnecessary complexity, N+1 queries, memory leaks
4. Maintainability: coupling, cohesion, documentation completeness
5. Standards compliance: PEP 8, type hints, docstring coverage

Provide specific, actionable feedback. Reference exact line numbers.
Categorize issues as: CRITICAL (must fix), WARNING (should fix),
or SUGGESTION (consider fixing). Do NOT rewrite the code yourself.

Designing Effective System Prompts

The quality of a multi-agent system depends heavily on the quality of each agent's system prompt. A well-designed system prompt does four things:

  1. Defines the role clearly. The agent must know exactly what it is responsible for and what it is not responsible for. Ambiguity in role definition leads to agents stepping on each other's work.

  2. Sets behavioral boundaries. Each agent must know what actions are within its scope and what actions are off-limits. The architect does not write code. The coder does not redesign the architecture. The tester does not fix bugs. These boundaries create the separation that makes multi-agent systems effective.

  3. Specifies output format. Each agent must produce output in a format that downstream agents can consume. The architect produces a design document with a specific structure. The coder produces code files. The tester produces test results. Standardizing these formats enables reliable handoffs.

  4. Establishes quality criteria. Each agent needs clear criteria for when its work is "done." The architect is done when all components, interfaces, and design decisions are documented. The coder is done when all components compile and pass basic sanity checks. The tester is done when all test cases are written and executed.

Practical Tip: The "Do NOT" Clause

Including explicit "Do NOT" instructions in system prompts is surprisingly effective at preventing role bleed. Without these negative constraints, agents will naturally try to be helpful by expanding their scope -- an architect that starts writing implementation code, a tester that starts fixing bugs it finds. Explicit prohibitions keep each agent focused on its assigned role.

Specialized Roles Beyond the Core Four

As systems grow more complex, you may need agents beyond the core four. Common additional roles include:

  • Security Agent: Focuses exclusively on security analysis -- OWASP Top 10 vulnerabilities, authentication flows, input validation, data encryption, and access control.
  • Documentation Agent: Produces user-facing documentation, API references, README files, and inline documentation based on the implemented code.
  • DevOps Agent: Handles deployment configuration, CI/CD pipeline setup, containerization, and infrastructure-as-code.
  • Performance Agent: Analyzes code for performance bottlenecks, suggests optimizations, and writes performance benchmarks.
  • Database Agent: Designs schemas, writes migrations, optimizes queries, and ensures data integrity constraints.

Each specialized role follows the same design principles: clear responsibility boundaries, specific tool access, focused system prompts, and standardized output formats.

Agent Capability Configuration

Beyond the system prompt, each agent needs appropriate tool access. The principle of least privilege applies: each agent should have access only to the tools it needs for its role.

Agent Tools Available Tools Denied
Architect File reader, codebase search, design doc writer Code editor, test runner, deployment tools
Coder File reader, code editor, linter, formatter Test runner, deployment tools, design doc writer
Tester File reader, test runner, coverage tool Code editor, deployment tools
Reviewer File reader, linter, static analyzer, codebase search Code editor, test runner, deployment tools

This tool restriction is not just about preventing mistakes -- it reinforces role focus. An agent that cannot edit code will not be tempted to fix issues it finds during review; instead, it will produce clear, actionable feedback for the agent whose role is to make changes.


38.3 Orchestration Patterns

What Is an Orchestrator?

An orchestrator is the component that coordinates the work of multiple agents. It decides which agent runs next, what information each agent receives, how results flow between agents, and what happens when something goes wrong. The orchestrator itself may be an AI agent, a simple script, or a combination of both.

Think of the orchestrator as a project manager. It does not write code, design systems, or run tests. It assigns tasks, tracks progress, routes information, and makes decisions about workflow.

Pattern 1: Sequential Pipeline

The simplest orchestration pattern is a sequential pipeline where agents execute one after another, each building on the output of the previous agent.

Feature Request
    |
    v
[Architect Agent] --> Design Document
    |
    v
[Coder Agent] --> Implementation Code
    |
    v
[Tester Agent] --> Test Results
    |
    v
[Reviewer Agent] --> Review Report
    |
    v
Final Output (or loop back for fixes)

Strengths: - Simple to implement and reason about - Clear handoff points between agents - Easy to debug because the flow is linear - Each agent has the full output of all previous agents available

Weaknesses: - Total execution time is the sum of all agent execution times - A failure at any stage blocks the entire pipeline - No parallelism, even for independent tasks - Later agents wait idle while earlier agents work

When to use it: For straightforward features where each phase depends on the previous one and total execution time is not a concern. This is the best pattern to start with when building your first multi-agent system.

Pattern 2: Parallel Execution

When tasks are independent, agents can execute simultaneously. This is most useful when multiple agents need to analyze the same input without depending on each other's output.

Feature Request
    |
    v
[Architect Agent] --> Design Document
    |
    v
[Coder Agent] --> Implementation Code
    |
    +------------------+------------------+
    |                  |                  |
    v                  v                  v
[Tester Agent]  [Reviewer Agent]  [Security Agent]
    |                  |                  |
    v                  v                  v
Test Results     Review Report    Security Report
    |                  |                  |
    +------------------+------------------+
    |
    v
[Aggregator] --> Combined Feedback

In this pattern, the tester, reviewer, and security agent all work on the same implementation simultaneously, producing independent reports that are merged at the end.

Strengths: - Reduced total execution time (parallelized phases run concurrently) - Independent agents cannot interfere with each other - Naturally scales to additional parallel agents

Weaknesses: - More complex to implement (requires concurrent execution management) - Agents may produce conflicting recommendations that need resolution - Resource usage spikes during parallel phases - Aggregation of results adds complexity

When to use it: When you have multiple independent analysis tasks (testing, reviewing, security scanning) that can run against the same code simultaneously.

Pattern 3: Hierarchical Delegation

In hierarchical delegation, a lead agent decomposes a complex task into subtasks and delegates each subtask to a specialized worker agent. The lead agent then assembles the results.

Complex Feature Request
    |
    v
[Lead Architect Agent]
    |
    +------------+------------+------------+
    |            |            |            |
    v            v            v            v
[Backend     [Frontend   [Database    [API
 Agent]       Agent]      Agent]      Agent]
    |            |            |            |
    v            v            v            v
Backend      Frontend     Schema &     API Spec
Code         Code         Migration
    |            |            |            |
    +------------+------------+------------+
    |
    v
[Lead Architect Agent] --> Integration & Review

Strengths: - Handles complex tasks that span multiple domains - Each worker agent has a narrow, well-defined scope - The lead agent maintains the big picture while workers handle details - Naturally mirrors how human tech leads delegate work

Weaknesses: - The lead agent is a single point of failure - Decomposition quality determines overall quality - Integration of subtask results can be challenging - Communication overhead increases with the number of workers

When to use it: For large features that involve multiple components, domains, or technology stacks. The hierarchical pattern is essential when a task is too complex for any single agent to handle but can be decomposed into largely independent pieces.

Design Decision: Choosing an Orchestration Pattern

Start with a sequential pipeline. It is the simplest to build, debug, and reason about. Add parallel execution when you identify independent tasks that are bottlenecking your pipeline. Move to hierarchical delegation when tasks become too complex for a single design-implement-test-review cycle. Most real systems use a hybrid approach: a sequential backbone with parallel branches where appropriate.

Pattern 4: Event-Driven Orchestration

In event-driven orchestration, agents react to events rather than following a predetermined sequence. An event might be "new code committed," "test failed," "review comment posted," or "security vulnerability detected."

# Event-driven orchestration pseudocode
event_handlers = {
    "code_committed": [tester_agent, reviewer_agent],
    "test_failed": [coder_agent],
    "review_comment": [coder_agent],
    "all_tests_passed": [security_agent],
    "security_clear": [deployer_agent],
}

This pattern is most useful for continuous integration scenarios where agents monitor a repository and respond to changes automatically.

Implementing Retry and Fallback Logic

Regardless of which orchestration pattern you choose, your orchestrator needs to handle failures gracefully. Agents may produce invalid output, exceed time limits, or encounter errors in tool execution. A robust orchestrator implements:

  • Retry with feedback: If an agent's output fails validation, retry the agent with the validation errors included in its prompt. Most agents can self-correct when told what went wrong.
  • Retry with escalation: If retry with feedback fails after a set number of attempts, escalate to a more capable model or a human reviewer.
  • Fallback agents: If the primary agent for a role fails, switch to an alternative agent with a different model or prompt strategy.
  • Graceful degradation: If a non-critical agent fails (such as the documentation agent), continue the pipeline without it rather than blocking the entire workflow.

38.4 Inter-Agent Communication

The Communication Challenge

Agents cannot share thoughts the way human team members can. Each agent runs independently, often in a separate process or API call, with its own context window and conversation history. Communication between agents must be explicit and structured. The three primary communication mechanisms are shared context, message passing, and artifact exchange.

Shared Context

In the shared context model, all agents read from and write to a shared data store -- a project workspace, a shared document, or a database. Each agent reads the current state of the workspace, performs its work, writes its results back, and the next agent sees the updated state.

# Shared context example
workspace = {
    "task": "Add user authentication to the web app",
    "design_document": None,    # Written by architect
    "source_files": {},         # Written by coder
    "test_files": {},           # Written by tester
    "test_results": None,       # Written by tester
    "review_report": None,      # Written by reviewer
    "status": "pending",
}

# Each agent reads the workspace and updates its section
architect_agent.run(workspace)  # Updates design_document
coder_agent.run(workspace)      # Updates source_files
tester_agent.run(workspace)     # Updates test_files, test_results
reviewer_agent.run(workspace)   # Updates review_report

Advantages: Simple to implement, all agents have access to full project state, natural audit trail.

Disadvantages: Risk of agents overwriting each other's work, no fine-grained access control, workspace can become very large.

Message Passing

In the message passing model, agents communicate by sending structured messages to each other. Each message has a sender, a recipient, a type, and a payload. The orchestrator routes messages between agents.

# Message passing example
@dataclass
class AgentMessage:
    sender: str          # "architect", "coder", "tester", "reviewer"
    recipient: str       # Target agent role
    message_type: str    # "design", "implementation", "test_results", etc.
    payload: dict        # Structured data
    timestamp: datetime

# Architect sends design to coder
message = AgentMessage(
    sender="architect",
    recipient="coder",
    message_type="design_document",
    payload={
        "components": [...],
        "interfaces": [...],
        "constraints": [...]
    },
    timestamp=datetime.now()
)
orchestrator.route(message)

Advantages: Fine-grained control over what information each agent receives, clear audit trail, supports asynchronous communication.

Disadvantages: More complex to implement, requires message format standardization, can become verbose for large payloads.

Artifact Exchange

In the artifact exchange model, agents communicate by producing and consuming artifacts -- files, documents, reports, and other tangible outputs. This is the most natural model for software development, where the primary artifacts are source code files, test files, design documents, and review reports.

# Artifact exchange example
artifacts = ArtifactStore()

# Architect produces a design artifact
design = artifacts.create(
    type="design_document",
    author="architect",
    content=design_doc_content,
    metadata={"version": 1, "status": "approved"}
)

# Coder consumes the design and produces code artifacts
for component in design.components:
    code = artifacts.create(
        type="source_file",
        author="coder",
        path=f"src/{component.name}.py",
        content=generated_code,
        references=[design.id]
    )

Advantages: Maps naturally to software development workflows, artifacts are independently verifiable, supports versioning and traceability.

Disadvantages: Requires artifact storage infrastructure, agents need to know how to find relevant artifacts, large artifacts consume context window space.

Practical Tip: Start with Artifacts

For software development multi-agent systems, artifact exchange is usually the best starting point. Software development is already organized around artifacts -- source files, test files, configuration files, documentation. Agents that produce and consume files integrate naturally with existing development tools like Git, CI/CD systems, and IDEs. You can always add message passing for coordination metadata on top of an artifact-based system.

Context Summarization

One of the biggest practical challenges in inter-agent communication is context window management. When the architect produces a 5,000-word design document, the coder needs to consume it -- but it may take up a significant portion of the coder's context window, leaving less room for the actual code being generated.

Context summarization addresses this by creating concise summaries of agent outputs for downstream consumption:

def summarize_for_downstream(full_output: str, target_role: str) -> str:
    """Create a role-appropriate summary of an agent's output.

    The architect's full design document might be 5000 words, but
    the coder only needs the interface definitions and constraints.
    The tester only needs the expected behaviors and edge cases.
    """
    summary_prompts = {
        "coder": "Extract only the interface definitions, data structures, "
                 "and implementation constraints from this design document.",
        "tester": "Extract only the expected behaviors, edge cases, "
                  "acceptance criteria, and testable requirements.",
        "reviewer": "Extract the design principles, quality requirements, "
                    "and standards that the implementation should follow.",
    }
    return summarize(full_output, summary_prompts[target_role])

This technique keeps downstream agents focused on the information most relevant to their role while staying within context window limits.


38.5 Conflict Resolution Between Agents

Why Conflicts Happen

When multiple agents analyze the same code or design, they will sometimes disagree. The architect may specify an interface that the coder finds impractical to implement. The tester may flag a behavior as a bug that the coder considers a feature. The reviewer may recommend a refactoring that contradicts the architect's design. These conflicts are not bugs in the system -- they are a feature. Productive disagreement catches problems that consensus would miss.

However, conflicts must be resolved for the pipeline to produce a coherent result. Unresolved conflicts lead to inconsistent code, incomplete implementations, and frustrated human operators.

Conflict Types

Design-Implementation Conflicts: The architect specifies something that the coder cannot implement cleanly. For example, the architect designs an interface with five methods, but the coder discovers that two of them require access to data that is not available at that layer of the system.

Implementation-Test Conflicts: The tester finds that the code does not match the specification. This might be a genuine bug, or it might be an ambiguity in the specification that the coder interpreted differently than the tester.

Review-Implementation Conflicts: The reviewer recommends changes that the coder believes would introduce other problems. For example, the reviewer suggests extracting a method for readability, but the coder knows that the extracted method would need to take eight parameters, making it less readable rather than more.

Cross-Agent Priority Conflicts: The security agent says a feature should be restricted, while the architect designed it to be open. The performance agent says a query should be denormalized, while the database agent designed a normalized schema.

Resolution Strategies

Strategy 1: Priority Hierarchy

Establish a clear priority order among agents. When conflicts arise, the higher-priority agent's recommendation wins.

# Priority hierarchy (highest to lowest)
PRIORITY_ORDER = [
    "security",     # Security concerns always win
    "architect",    # Design decisions take precedence
    "reviewer",     # Quality standards over implementation convenience
    "tester",       # Test findings inform implementation changes
    "coder",        # Implementation concerns are last resort
]

This is the simplest approach but can be too rigid. A blanket rule that the architect always overrides the coder ignores cases where the coder has discovered a genuine design flaw.

Strategy 2: Evidence-Based Resolution

Require agents to provide evidence for their positions. The resolution favors the agent with stronger evidence.

@dataclass
class AgentRecommendation:
    agent_role: str
    recommendation: str
    evidence: list[str]         # Supporting reasons
    severity: str               # "critical", "warning", "suggestion"
    references: list[str]       # Links to code, docs, or standards

def resolve_conflict(rec_a: AgentRecommendation,
                     rec_b: AgentRecommendation) -> AgentRecommendation:
    """Resolve a conflict between two agent recommendations.

    Higher severity wins. If equal severity, more evidence wins.
    If still tied, escalate to human review.
    """
    severity_rank = {"critical": 3, "warning": 2, "suggestion": 1}

    if severity_rank[rec_a.severity] != severity_rank[rec_b.severity]:
        return (rec_a if severity_rank[rec_a.severity] >
                severity_rank[rec_b.severity] else rec_b)

    if len(rec_a.evidence) != len(rec_b.evidence):
        return rec_a if len(rec_a.evidence) > len(rec_b.evidence) else rec_b

    # Tie: escalate to human
    raise ConflictEscalation(rec_a, rec_b)

Strategy 3: Mediator Agent

Introduce a dedicated mediator agent that receives conflicting recommendations and produces a resolution. The mediator has access to the full context and can weigh both sides impartially.

System Prompt (Mediator Agent):
You are a technical mediator. When two agents disagree, you receive
both positions with their evidence. Your job is to:
1. Understand each agent's position and reasoning
2. Identify the underlying concern behind each position
3. Determine if there is a solution that addresses both concerns
4. If not, decide which concern takes priority and explain why
5. Produce a clear resolution with rationale

You must be impartial. Do not default to either agent's position.
Consider the evidence, the project context, and best practices.

Strategy 4: Human Escalation

Some conflicts cannot and should not be resolved automatically. When agents disagree on a fundamental design question, when the stakes are high (security, data integrity), or when the evidence is genuinely balanced, the right answer is to escalate to a human developer.

def should_escalate(conflict: Conflict) -> bool:
    """Determine if a conflict should be escalated to human review."""
    return (
        conflict.involves_security or
        conflict.involves_data_integrity or
        conflict.severity == "critical" or
        conflict.auto_resolution_attempts >= 3 or
        conflict.estimated_impact == "high"
    )

Key Insight: Conflicts Are Information

Do not treat agent conflicts as failures in your system. Treat them as valuable information. When the architect and the coder disagree, it often reveals an ambiguity in the requirements, a gap in the design, or a constraint that was not initially considered. A multi-agent system that never produces conflicts is probably not getting enough diverse perspectives. The goal is not to eliminate conflicts but to resolve them efficiently and learn from them.


38.6 Workflow Automation with Multiple Agents

From Issue to Pull Request

The most compelling application of multi-agent development systems is end-to-end workflow automation: taking a feature request or bug report and producing a complete, tested, reviewed pull request with minimal human intervention.

Here is what a fully automated multi-agent workflow looks like:

1. Issue Created (Human writes a feature request)
        |
        v
2. [Planner Agent] Analyzes the issue, identifies affected files,
   and creates a task breakdown
        |
        v
3. [Architect Agent] Designs the solution, specifying components,
   interfaces, and constraints
        |
        v
4. [Coder Agent] Implements the design across all affected files
        |
        v
5. [Tester Agent] Writes tests and runs them against the implementation
        |
        +-- Tests fail? --> [Coder Agent] fixes and re-runs
        |
        v
6. [Reviewer Agent] Reviews the complete changeset
        |
        +-- Critical issues? --> [Coder Agent] addresses feedback
        |
        v
7. [Documentation Agent] Updates docs and changelogs
        |
        v
8. Pull Request Created (Human reviews and merges)

Implementing the Workflow

A practical workflow implementation combines orchestration with state management:

class DevelopmentWorkflow:
    """Orchestrates a multi-agent development workflow."""

    def __init__(self, agents: dict[str, Agent]):
        self.agents = agents
        self.state = WorkflowState()

    async def run(self, issue: Issue) -> PullRequest:
        """Execute the complete workflow from issue to PR."""

        # Phase 1: Planning
        plan = await self.agents["planner"].execute(
            task="Analyze this issue and create a task breakdown",
            context={"issue": issue, "codebase": self.codebase_summary}
        )
        self.state.update("planning", plan)

        # Phase 2: Design
        design = await self.agents["architect"].execute(
            task="Design a solution for this plan",
            context={"plan": plan, "existing_architecture": self.arch_docs}
        )
        self.state.update("design", design)

        # Phase 3: Implementation
        code = await self.agents["coder"].execute(
            task="Implement this design",
            context={"design": design, "code_style": self.style_guide}
        )
        self.state.update("implementation", code)

        # Phase 4: Testing (with retry loop)
        for attempt in range(3):
            test_results = await self.agents["tester"].execute(
                task="Write and run tests for this implementation",
                context={"design": design, "code": code}
            )
            if test_results.all_passed:
                break
            code = await self.agents["coder"].execute(
                task="Fix the failing tests",
                context={"code": code, "failures": test_results.failures}
            )
        self.state.update("testing", test_results)

        # Phase 5: Review (with fix loop)
        review = await self.agents["reviewer"].execute(
            task="Review this changeset",
            context={"design": design, "code": code, "tests": test_results}
        )
        if review.has_critical_issues:
            code = await self.agents["coder"].execute(
                task="Address these review comments",
                context={"code": code, "review": review}
            )
        self.state.update("review", review)

        # Phase 6: Create PR
        return self.create_pull_request(code, design, test_results, review)

Feedback Loops

The most important feature of an automated workflow is feedback loops. When a test fails, the failing test and its error message are fed back to the coder agent for correction. When a reviewer flags an issue, the feedback is routed back to the appropriate agent. These loops are what make the system self-correcting.

However, feedback loops must be bounded. Without a maximum iteration count, a failing test could cause an infinite loop of code changes that never resolve the issue. Best practice is to allow 2-3 iterations of each feedback loop before escalating to a human.

Practical Tip: The Three-Strike Rule

Give each agent three attempts to resolve feedback. If the coder cannot fix a failing test after three attempts, or the implementation cannot pass review after three rounds of feedback, escalate to a human. This prevents infinite loops while still giving agents a fair chance to self-correct.

Checkpoint and Resume

Long-running workflows should support checkpointing -- saving the state of the workflow at each phase boundary so it can be resumed if interrupted. This is especially important when using API-based agents that may encounter rate limits, timeouts, or transient errors.

class CheckpointedWorkflow:
    """A workflow that saves state at each phase boundary."""

    def save_checkpoint(self, phase: str, data: dict) -> None:
        checkpoint = {
            "phase": phase,
            "data": data,
            "timestamp": datetime.now().isoformat(),
            "agent_versions": self.get_agent_versions(),
        }
        Path(f".workflow/checkpoint_{phase}.json").write_text(
            json.dumps(checkpoint, indent=2)
        )

    def resume_from_checkpoint(self) -> str:
        """Find the latest checkpoint and resume from there."""
        checkpoints = sorted(Path(".workflow").glob("checkpoint_*.json"))
        if not checkpoints:
            return "start"
        latest = json.loads(checkpoints[-1].read_text())
        self.state.restore(latest["data"])
        return latest["phase"]

38.7 Quality Assurance in Multi-Agent Systems

Cross-Agent Verification

The most powerful quality assurance technique in multi-agent systems is cross-agent verification: having one agent check another agent's work. This works because each agent brings a different perspective and a different set of biases.

Design-Implementation Verification: After the coder produces the implementation, a verification step checks that every component, interface, and constraint in the architect's design is correctly implemented.

async def verify_implementation(design: DesignDoc,
                                 code: dict[str, str]) -> VerificationReport:
    """Verify that implementation matches the design specification."""
    verifier = Agent(
        role="verifier",
        system_prompt="""Compare the design specification against the
        implementation. For each element in the design, verify:
        1. The component exists in the code
        2. The interface matches (function signatures, parameter types)
        3. The documented constraints are respected
        4. The data flow matches the design
        Report any discrepancies with specific details."""
    )
    return await verifier.execute(
        context={"design": design, "implementation": code}
    )

Test Coverage Verification: After the tester produces tests, a verification step checks that the tests actually cover the requirements and edge cases specified in the design.

Review Consistency Verification: After the reviewer produces feedback, a verification step checks that the feedback is consistent with the project's coding standards and does not contradict previous review decisions.

The Adversarial Testing Pattern

A particularly effective quality assurance pattern is adversarial testing, where one agent is specifically designed to find flaws in another agent's work.

System Prompt (Adversarial Tester):
You are a hostile QA engineer. Your job is to break the code.
For each function and method:
1. What happens with None/null inputs?
2. What happens with empty strings, empty lists, zero values?
3. What happens with extremely large inputs?
4. What happens with Unicode characters, special characters?
5. What happens with concurrent access?
6. What happens if external services are unavailable?
7. What happens if the database is down?
8. What happens if the file system is full?

Write tests that exercise these failure modes. Your success is
measured by bugs found, not by tests passed.

The adversarial tester finds bugs that a standard tester misses because its incentive structure is inverted: it succeeds when code fails.

Multi-Layer Review

Instead of a single review pass, implement multiple review layers, each focused on a different aspect of quality:

review_layers = [
    ReviewLayer(
        name="correctness",
        focus="Does the code do what the design says it should do?",
        agent=correctness_reviewer
    ),
    ReviewLayer(
        name="security",
        focus="Does the code have security vulnerabilities?",
        agent=security_reviewer
    ),
    ReviewLayer(
        name="performance",
        focus="Does the code have performance issues?",
        agent=performance_reviewer
    ),
    ReviewLayer(
        name="maintainability",
        focus="Is the code maintainable and well-structured?",
        agent=maintainability_reviewer
    ),
]

# Run all review layers in parallel
results = await asyncio.gather(*[
    layer.agent.review(code) for layer in review_layers
])

Key Concept: Defense in Depth

Just as security systems use defense in depth -- multiple layers of protection so that if one fails, others catch the threat -- multi-agent quality assurance uses multiple verification layers so that if one agent misses an issue, another catches it. No single agent is responsible for all quality dimensions. The system's reliability comes from the combination of independent checks.

Measuring Quality Across the Pipeline

To know whether your multi-agent system is actually producing better results than a single agent, you need metrics:

  • Defect escape rate: How many bugs make it through the entire pipeline to production?
  • First-pass success rate: What percentage of implementations pass tests and review on the first attempt?
  • Rework rate: How many feedback loop iterations are needed on average?
  • Coverage completeness: What percentage of requirements have corresponding tests?
  • Review finding rate: How many issues does the reviewer catch per review?
  • Time to completion: How long does the entire pipeline take from issue to PR?

Track these metrics over time to identify where your pipeline is strong and where it needs improvement.


38.8 Scaling Agent Teams

The Coordination Tax

Adding more agents to a team does not automatically improve results. Each additional agent adds coordination overhead: more messages to route, more potential conflicts to resolve, more context to share, and more failure points to monitor. This coordination overhead is the "tax" you pay for the benefits of specialization.

The relationship between team size and productivity follows a pattern familiar from human teams: small increases in team size produce large improvements, but beyond a certain point, the coordination tax exceeds the benefit of the additional agent.

Productivity
    ^
    |         .----.
    |        /      \
    |       /        \
    |      /          \
    |     /            \
    |    /              \
    |   /                \
    |  /                  \
    | /                    \
    +-------------------------> Team Size
    1  2  3  4  5  6  7  8  9

For most development tasks, the sweet spot is 3-5 agents. Beyond that, coordination costs begin to outweigh specialization benefits unless the task is complex enough to justify the overhead.

Strategies for Scaling

Strategy 1: Hierarchical Teams

Instead of one orchestrator managing ten agents, create hierarchies where lead agents manage small sub-teams.

[Lead Orchestrator]
    |
    +-- [Frontend Lead] --> [UI Agent], [Style Agent], [A11y Agent]
    |
    +-- [Backend Lead] --> [API Agent], [DB Agent], [Auth Agent]
    |
    +-- [QA Lead] --> [Unit Test Agent], [Integration Test Agent], [Perf Agent]

Each lead manages 2-4 workers, keeping the span of control manageable. The lead orchestrator only communicates with the leads, not with every individual agent.

Strategy 2: Domain-Based Partitioning

Divide agents by domain rather than by role. Each domain gets its own mini-team with its own architect, coder, and tester.

[Frontend Team]                [Backend Team]
  - Frontend Architect           - Backend Architect
  - Frontend Coder               - Backend Coder
  - Frontend Tester              - Backend Tester

[Shared Services]
  - Security Reviewer (reviews both teams' output)
  - Integration Tester (tests cross-domain interactions)

This reduces inter-team communication because most interactions happen within a domain. Cross-domain communication only happens when components need to integrate.

Strategy 3: Dynamic Team Composition

Not every task needs every agent. A dynamic system selects agents based on task requirements.

def assemble_team(task: Task) -> list[Agent]:
    """Select agents based on task requirements."""
    team = [planner_agent]  # Always needed

    if task.requires_design:
        team.append(architect_agent)

    team.append(coder_agent)  # Always needed

    if task.has_tests or task.requires_tests:
        team.append(tester_agent)

    if task.touches_security_sensitive_code:
        team.append(security_agent)

    if task.modifies_database:
        team.append(database_agent)

    if task.changes_api:
        team.append(api_compatibility_agent)

    team.append(reviewer_agent)  # Always needed

    return team

This approach keeps team size small for simple tasks while scaling up for complex ones.

Resource Management

Scaling agent teams means scaling API usage. Each agent consumes tokens (and therefore cost), and parallel execution multiplies resource usage. Practical resource management includes:

  • Token budgets per agent: Set maximum token limits for each agent's input and output to prevent runaway costs.
  • Concurrency limits: Cap the number of agents running simultaneously to stay within API rate limits.
  • Model tiering: Use less expensive models for routine tasks (linting, formatting) and more capable models for complex tasks (architecture, security review).
  • Caching: Cache agent outputs for tasks that are repeated (such as reviewing the same file against the same standards).
class ResourceManager:
    """Manages resource allocation for agent teams."""

    def __init__(self, max_concurrent: int = 5,
                 budget_per_run: float = 10.0):
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.budget = budget_per_run
        self.spent = 0.0

    async def execute_agent(self, agent: Agent, task: str) -> str:
        if self.spent >= self.budget:
            raise BudgetExhausted(f"Spent ${self.spent:.2f} "
                                  f"of ${self.budget:.2f} budget")
        async with self.semaphore:
            result = await agent.execute(task)
            self.spent += result.cost
            return result

Warning: Cost Awareness

A multi-agent pipeline that runs architect, coder, tester, reviewer, and security agents -- each making multiple API calls -- can easily cost 10-50x more per task than a single agent. Always implement budget controls, monitor costs per run, and evaluate whether the quality improvement justifies the cost increase. For many tasks, a single well-prompted agent remains the most cost-effective approach.


38.9 Monitoring and Observability

Why Monitoring Matters

A multi-agent system is significantly more complex than a single agent. When something goes wrong -- and it will -- you need visibility into what each agent did, what it produced, how long it took, and where the failure occurred. Without monitoring, debugging a multi-agent pipeline is like debugging a distributed system with no logs.

What to Monitor

Agent-Level Metrics: - Execution time per agent - Token usage (input and output) per agent - Success/failure rate per agent - Number of retry attempts per agent - Quality score of agent output (if measurable)

Pipeline-Level Metrics: - Total pipeline execution time - End-to-end success rate - Number of feedback loop iterations - Cost per pipeline run - Stage where failures most commonly occur

Communication Metrics: - Message volume between agents - Context size passed to each agent - Number of conflicts generated - Conflict resolution success rate

Implementing Observability

A practical monitoring implementation wraps each agent with instrumentation:

class MonitoredAgent:
    """Wraps an agent with monitoring instrumentation."""

    def __init__(self, agent: Agent, metrics: MetricsCollector):
        self.agent = agent
        self.metrics = metrics

    async def execute(self, task: str, context: dict) -> AgentResult:
        start_time = time.time()
        run_id = str(uuid.uuid4())

        self.metrics.record_start(
            agent=self.agent.role,
            run_id=run_id,
            context_size=len(str(context)),
        )

        try:
            result = await self.agent.execute(task, context)
            elapsed = time.time() - start_time

            self.metrics.record_success(
                agent=self.agent.role,
                run_id=run_id,
                elapsed_seconds=elapsed,
                input_tokens=result.input_tokens,
                output_tokens=result.output_tokens,
                cost=result.cost,
            )
            return result

        except Exception as e:
            elapsed = time.time() - start_time
            self.metrics.record_failure(
                agent=self.agent.role,
                run_id=run_id,
                elapsed_seconds=elapsed,
                error=str(e),
            )
            raise

Structured Logging

Every agent action should produce a structured log entry that can be searched and analyzed:

import structlog

logger = structlog.get_logger()

class AgentLogger:
    """Provides structured logging for agent actions."""

    def log_agent_start(self, agent_role: str, task: str,
                        run_id: str) -> None:
        logger.info(
            "agent.started",
            agent_role=agent_role,
            task_summary=task[:200],
            run_id=run_id,
        )

    def log_agent_output(self, agent_role: str, output_type: str,
                         output_size: int, run_id: str) -> None:
        logger.info(
            "agent.output",
            agent_role=agent_role,
            output_type=output_type,
            output_size_chars=output_size,
            run_id=run_id,
        )

    def log_conflict(self, agent_a: str, agent_b: str,
                     conflict_type: str, resolution: str,
                     run_id: str) -> None:
        logger.warning(
            "agent.conflict",
            agent_a=agent_a,
            agent_b=agent_b,
            conflict_type=conflict_type,
            resolution=resolution,
            run_id=run_id,
        )

Trace Visualization

For debugging complex multi-agent interactions, a trace visualization shows the sequence of agent actions, their inputs and outputs, and the data flow between them:

Pipeline Run #42 - Feature: Add user authentication
================================================================
[12:00:01] PLANNER    started  | Input: 1,200 tokens
[12:00:08] PLANNER    complete | Output: 800 tokens | 7.2s | $0.02
[12:00:08] ARCHITECT  started  | Input: 2,000 tokens
[12:00:22] ARCHITECT  complete | Output: 3,500 tokens | 14.1s | $0.08
[12:00:22] CODER      started  | Input: 4,200 tokens
[12:00:45] CODER      complete | Output: 5,800 tokens | 23.4s | $0.12
[12:00:45] TESTER     started  | Input: 6,000 tokens (parallel)
[12:00:45] REVIEWER   started  | Input: 5,500 tokens (parallel)
[12:01:02] TESTER     complete | 3/15 tests FAILED | 17.1s | $0.06
[12:01:05] REVIEWER   complete | 2 critical issues | 20.3s | $0.07
[12:01:05] CODER      started  | Input: 7,200 tokens (fix cycle 1)
[12:01:28] CODER      complete | Output: 4,100 tokens | 23.0s | $0.10
[12:01:28] TESTER     started  | Input: 5,800 tokens (retest)
[12:01:40] TESTER     complete | 15/15 tests PASSED | 12.2s | $0.05
================================================================
Total: 117.3s | $0.50 | 1 fix cycle | Status: SUCCESS

Practical Tip: Log Everything, Display Selectively

Log every detail of every agent interaction to persistent storage. But do not dump all of that information on the screen during normal operation. Instead, provide a concise summary with the ability to drill down into details when needed. The trace visualization above is a summary view; the full logs behind it contain the complete input and output of every agent call.

Alerting and Anomaly Detection

For multi-agent systems running in production or semi-production environments, set up alerts for anomalous behavior:

  • Agent execution time exceeds 2x the historical average
  • Token usage per run exceeds budget
  • Failure rate for any agent exceeds a threshold
  • Number of feedback loop iterations exceeds the maximum
  • Conflict resolution fails (requires human escalation)

38.10 Building a Multi-Agent Development Pipeline

Putting It All Together

This section walks through building a complete multi-agent development pipeline from scratch. The pipeline takes a GitHub issue as input and produces a pull request as output, using the concepts from every previous section in this chapter.

Step 1: Define the Agent Team

Start by defining your agents with clear roles, system prompts, and tool access. For this pipeline, we use the core four agents plus a planner:

from agents import Agent, Tool

planner = Agent(
    role="planner",
    model="claude-sonnet-4-20250514",
    system_prompt=PLANNER_PROMPT,
    tools=[Tool.read_file, Tool.search_codebase, Tool.read_issue],
    max_tokens=4096,
)

architect = Agent(
    role="architect",
    model="claude-sonnet-4-20250514",
    system_prompt=ARCHITECT_PROMPT,
    tools=[Tool.read_file, Tool.search_codebase, Tool.write_design_doc],
    max_tokens=8192,
)

coder = Agent(
    role="coder",
    model="claude-sonnet-4-20250514",
    system_prompt=CODER_PROMPT,
    tools=[Tool.read_file, Tool.write_file, Tool.run_linter],
    max_tokens=8192,
)

tester = Agent(
    role="tester",
    model="claude-sonnet-4-20250514",
    system_prompt=TESTER_PROMPT,
    tools=[Tool.read_file, Tool.write_file, Tool.run_tests],
    max_tokens=8192,
)

reviewer = Agent(
    role="reviewer",
    model="claude-sonnet-4-20250514",
    system_prompt=REVIEWER_PROMPT,
    tools=[Tool.read_file, Tool.search_codebase, Tool.run_linter],
    max_tokens=4096,
)

Step 2: Design the Orchestration Flow

Use a sequential pipeline with parallel branches for independent analysis:

async def run_pipeline(issue: Issue) -> PullRequest:
    """Execute the full multi-agent development pipeline."""

    # Sequential: Planning
    plan = await planner.execute(f"Analyze issue #{issue.number}: "
                                 f"{issue.title}\n{issue.body}")

    # Sequential: Architecture
    design = await architect.execute(
        f"Design a solution based on this plan:\n{plan.output}"
    )

    # Sequential: Implementation
    implementation = await coder.execute(
        f"Implement this design:\n{design.output}"
    )

    # Parallel: Testing and Review
    test_results, review = await asyncio.gather(
        tester.execute(f"Test this implementation:\n"
                       f"{implementation.output}"),
        reviewer.execute(f"Review this implementation:\n"
                         f"{implementation.output}")
    )

    # Fix loop if needed
    if not test_results.all_passed or review.has_critical_issues:
        feedback = compile_feedback(test_results, review)
        implementation = await coder.execute(
            f"Address this feedback:\n{feedback}\n\n"
            f"Current code:\n{implementation.output}"
        )

    return create_pr(issue, design, implementation, test_results, review)

Step 3: Implement Communication

Set up the artifact store and message routing:

class PipelineArtifacts:
    """Manages artifacts produced by the pipeline."""

    def __init__(self, workspace_dir: Path):
        self.workspace = workspace_dir
        self.workspace.mkdir(parents=True, exist_ok=True)
        self.manifest: list[Artifact] = []

    def store(self, artifact_type: str, content: str,
              author: str, metadata: dict | None = None) -> Artifact:
        """Store an artifact and return its reference."""
        artifact = Artifact(
            id=str(uuid.uuid4()),
            type=artifact_type,
            author=author,
            content=content,
            metadata=metadata or {},
            created_at=datetime.now(),
        )
        self.manifest.append(artifact)

        # Write to disk for persistence
        artifact_path = self.workspace / f"{artifact.id}.json"
        artifact_path.write_text(json.dumps(artifact.to_dict(), indent=2))

        return artifact

    def get_by_type(self, artifact_type: str) -> list[Artifact]:
        """Retrieve all artifacts of a given type."""
        return [a for a in self.manifest if a.type == artifact_type]

Step 4: Add Conflict Resolution

Implement a conflict resolver that handles disagreements between the tester and reviewer:

class ConflictResolver:
    """Resolves conflicts between agent recommendations."""

    def __init__(self, mediator: Agent):
        self.mediator = mediator
        self.resolution_log: list[Resolution] = []

    async def resolve(self, conflict: Conflict) -> Resolution:
        """Resolve a conflict between two agents."""

        # Try automatic resolution first
        if conflict.severity_a != conflict.severity_b:
            resolution = self.resolve_by_severity(conflict)
        elif conflict.type == "style":
            resolution = self.resolve_by_standard(conflict)
        else:
            # Use mediator agent for complex conflicts
            resolution = await self.mediator.execute(
                f"Resolve this conflict:\n"
                f"Agent A ({conflict.agent_a}): {conflict.position_a}\n"
                f"Agent B ({conflict.agent_b}): {conflict.position_b}\n"
                f"Context: {conflict.context}"
            )

        self.resolution_log.append(resolution)
        return resolution

Step 5: Add Monitoring

Wrap the pipeline with monitoring instrumentation:

class PipelineMonitor:
    """Monitors pipeline execution and produces reports."""

    def __init__(self):
        self.events: list[PipelineEvent] = []
        self.start_time: float | None = None

    def start_pipeline(self, issue_id: str) -> None:
        self.start_time = time.time()
        self.events.append(PipelineEvent(
            type="pipeline_start",
            timestamp=time.time(),
            data={"issue_id": issue_id},
        ))

    def record_agent_result(self, agent_role: str,
                            success: bool, elapsed: float,
                            tokens_used: int, cost: float) -> None:
        self.events.append(PipelineEvent(
            type="agent_complete",
            timestamp=time.time(),
            data={
                "agent": agent_role,
                "success": success,
                "elapsed_seconds": elapsed,
                "tokens": tokens_used,
                "cost": cost,
            },
        ))

    def generate_report(self) -> PipelineReport:
        total_time = time.time() - self.start_time
        total_cost = sum(
            e.data["cost"] for e in self.events
            if e.type == "agent_complete"
        )
        return PipelineReport(
            total_time=total_time,
            total_cost=total_cost,
            agent_count=len(set(
                e.data["agent"] for e in self.events
                if e.type == "agent_complete"
            )),
            events=self.events,
        )

Step 6: End-to-End Integration

Finally, integrate everything into a single entry point:

async def main():
    """Run the complete multi-agent development pipeline."""

    # Initialize components
    artifacts = PipelineArtifacts(Path(".pipeline/artifacts"))
    monitor = PipelineMonitor()
    resolver = ConflictResolver(mediator=mediator_agent)

    # Get the issue
    issue = await github.get_issue(repo="myorg/myrepo", number=42)

    # Run the pipeline
    monitor.start_pipeline(issue.id)

    try:
        pr = await run_pipeline(issue)
        monitor.complete_pipeline(success=True)
        print(f"Pull request created: {pr.url}")
    except Exception as e:
        monitor.complete_pipeline(success=False, error=str(e))
        print(f"Pipeline failed: {e}")
    finally:
        report = monitor.generate_report()
        report.save(".pipeline/reports/")
        print(report.summary())

Real-World Considerations

Building a multi-agent pipeline for production use requires addressing several practical concerns:

Idempotency. If the pipeline is interrupted and restarted, it should not duplicate work. Use checkpoints and artifact deduplication to ensure each step is executed at most once.

Determinism. AI agents are inherently non-deterministic. The same input may produce different outputs on different runs. For reproducibility, log the complete input and output of every agent call, including the model version, temperature, and seed (if available).

Human oversight. Even the most sophisticated multi-agent pipeline should include human checkpoints for high-stakes decisions. The pipeline should pause for human approval before merging a PR, modifying critical infrastructure, or making changes to security-sensitive code.

Cost control. Set per-run and per-agent budgets. Monitor costs in real time and abort runs that exceed thresholds. Use cheaper models for routine tasks and reserve expensive models for complex reasoning.

Error recovery. When an agent fails, the pipeline should attempt recovery before giving up. Common recovery strategies include retrying with a different prompt, retrying with a more capable model, and falling back to a simpler approach.

Chapter Connection: From Single Agent to Multi-Agent

In Chapter 36, you learned to build a single agent that reads code, makes changes, and runs tests. In Chapter 37, you learned to give agents custom tools and connect them to external services through MCP. This chapter has shown you how to compose multiple agents into a coordinated team that mirrors a professional development workflow. The progression from tool use (Chapter 37) to single agent (Chapter 36) to multi-agent system (this chapter) represents a trajectory toward increasingly autonomous software development. Chapter 39 will take the next step: building applications that themselves use AI as a core capability.


Summary

Multi-agent development systems bring the power of team-based software development to AI-assisted coding. By splitting work across specialized agents -- architect, coder, tester, reviewer -- you achieve better results than any single agent can produce alone.

The key principles are:

  1. Specialize agents with focused system prompts, role-specific tools, and clear behavioral boundaries.
  2. Choose the right orchestration pattern for your task: sequential for simple flows, parallel for independent analysis, hierarchical for complex decomposition.
  3. Communicate through artifacts that map naturally to software development outputs.
  4. Embrace conflicts as valuable information and resolve them through priority hierarchies, evidence-based evaluation, or mediator agents.
  5. Automate end-to-end from issue to pull request, with bounded feedback loops for self-correction.
  6. Verify across agents using adversarial testing and multi-layer review.
  7. Scale carefully by managing coordination overhead, partitioning by domain, and assembling dynamic teams.
  8. Monitor everything with structured logging, metrics, and trace visualization.

The multi-agent development pipeline is not a replacement for human developers -- it is a force multiplier. It handles the routine, repetitive aspects of software development at speed and scale, freeing human developers to focus on the creative, strategic, and judgment-intensive work that remains beyond AI's reach.


Looking Ahead

In Chapter 39, you will learn how to build applications that use AI as a core feature -- not just as a development tool, but as a runtime component of the software you deliver to users. The multi-agent patterns you learned here will reappear in that context, as user-facing AI applications often use multiple specialized models working together to deliver intelligent behavior.