Chapter 38: Key Takeaways

Multi-Agent Development Systems

Multiple specialized agents outperform a single generalist agent for complex tasks. A single agent faces context window saturation, attention diffusion, and role confusion when juggling design, implementation, testing, and review simultaneously. Splitting work across focused agents with distinct system prompts, tool sets, and behavioral boundaries produces higher-quality results -- just as a team of human specialists outperforms a lone generalist.
The four core agent roles mirror a professional software team. The Architect designs systems and defines interfaces. The Coder implements designs exactly as specified. The Tester writes and runs tests adversarially to find bugs. The Reviewer evaluates code quality, security, performance, and maintainability. Each role has explicit "Do NOT" constraints that prevent it from overstepping its boundaries, and this separation of concerns is what makes the system effective.
Orchestration pattern selection depends on task structure. Sequential pipelines are simplest and best for getting started. Parallel execution reduces total time when agents can analyze the same input independently (testing, review, and security scanning run simultaneously). Hierarchical delegation handles complex tasks by having a lead agent decompose work into sub-tasks for specialist workers. Event-driven orchestration suits CI/CD scenarios where agents react to repository events. Most production systems use a hybrid approach.
Artifact exchange is the most natural communication mechanism for software development. While shared context and message passing have their uses, software development is already organized around files -- source code, tests, configuration, documentation. Agents that produce and consume file-based artifacts integrate naturally with Git, CI/CD, and IDEs. Layer message passing on top for coordination metadata.
Context summarization keeps downstream agents focused and efficient. A 5,000-word design document can be summarized differently for the coder (interface definitions and constraints), the tester (expected behaviors and edge cases), and the reviewer (design principles and quality standards). Role-appropriate summaries prevent context window waste and keep each agent focused on what matters for its role.
Conflicts between agents are valuable information, not system failures. When the architect and coder disagree, it often reveals a design ambiguity. When the tester and reviewer conflict, it may expose an unconsidered requirement. Resolve conflicts through priority hierarchies (security always wins), evidence-based evaluation (stronger evidence prevails), mediator agents (an impartial third party weighs both sides), or human escalation (for high-stakes or genuinely balanced disagreements).
Feedback loops must be bounded to prevent infinite cycling. The Three-Strike Rule gives each agent three attempts to resolve issues (failing tests, review feedback) before escalating to a human. Without a maximum iteration count, a stubborn bug could cause an infinite loop of attempted fixes. Bounded loops balance self-correction capability with practical time limits.
Cross-agent verification and adversarial testing catch errors that no single agent would find. Having one agent check another's work leverages their different perspectives and biases. Adversarial testers, prompted to break code rather than validate it, find edge-case bugs that standard testers miss. Multi-layer review (correctness, security, performance, maintainability) provides defense in depth where multiple independent checks compensate for any single check's blind spots.
The coordination tax limits optimal team size to 3-5 agents for most tasks. Each additional agent adds overhead for message routing, conflict resolution, context sharing, and failure monitoring. Beyond the sweet spot, this overhead exceeds the benefit of further specialization. Scale beyond five agents using hierarchical teams (lead agents managing small sub-teams), domain-based partitioning (each domain gets its own mini-team), or dynamic team composition (assembling only the agents each task requires).
Resource management is essential for cost control. Multi-agent pipelines can cost 10-50x more per task than single-agent approaches. Implement token budgets per agent, concurrency limits to stay within API rate limits, model tiering (cheaper models for routine tasks, capable models for complex reasoning), and caching of repeated analysis. Always set per-run budget caps with hard enforcement.
Comprehensive monitoring and observability are non-negotiable. Track agent-level metrics (execution time, token usage, success rate), pipeline-level metrics (total time, cost, feedback loop count), and communication metrics (message volume, conflict rates). Use structured logging with correlation IDs for debugging. Build trace visualizations that show the sequence of agent actions with timing and cost. Set alerts for anomalous behavior like budget overruns, excessive retries, or unusual failure rates.
Checkpointing enables resilience in long-running pipelines. Save pipeline state at each phase boundary so interrupted runs can resume from the last successful phase rather than starting over. This is critical when using API-based agents that may encounter rate limits, timeouts, or transient errors. Combine checkpointing with idempotency to ensure each step executes at most once.
Human oversight remains essential even in highly automated pipelines. The pipeline should pause for human approval before merging pull requests, modifying critical infrastructure, or changing security-sensitive code. The appropriate level of human involvement depends on the project's risk profile: financial systems need more checkpoints than internal tools. Start with higher oversight and reduce it as the system proves reliable.
Not every task needs multiple agents. Simple function generation, quick bug fixes, and small scripts are perfectly suited to a single agent. Multi-agent systems add coordination overhead that is only justified when the benefits of specialization -- better defect detection, architectural consistency, review quality, and throughput -- outweigh it. Start with a single agent and split to multiple agents only when you observe context loss, inconsistent results, or missed errors.