Case Study 1: The Four-Agent Development Team

Background

Kai Nakamura is a senior engineer at Streamline Analytics, a mid-stage startup that provides real-time data dashboards for e-commerce companies. The engineering team of twelve developers maintains a Python backend built on FastAPI with a PostgreSQL database and a React frontend. The codebase has grown to roughly 80,000 lines of Python and 60,000 lines of TypeScript over three years.

Streamline's customers have been requesting a feature for months: scheduled report generation. Customers want to configure recurring reports -- daily sales summaries, weekly inventory snapshots, monthly revenue breakdowns -- that are automatically generated and emailed as PDF attachments. The feature touches several areas of the codebase: the scheduling system, the report generation engine, the PDF rendering pipeline, the email delivery service, and the user-facing configuration UI.

Kai has been experimenting with AI coding assistants for six months and has noticed a recurring pattern. For simple tasks -- fixing a bug in a single file, adding a new endpoint, writing tests for an existing function -- a single AI agent works well. But for features that span multiple components, the agent's quality degrades. It loses track of design decisions made earlier in the conversation. It writes code that contradicts its own architectural recommendations. It produces tests that verify implementation details rather than requirements.

After reading about multi-agent development systems, Kai decides to try a four-agent approach for the scheduled reports feature. He sets up an Architect, a Coder, a Tester, and a Reviewer, each with its own system prompt, tool access, and focused responsibility.

Setting Up the Agent Team

The Architect Agent

Kai begins by configuring the Architect agent. He gives it read-only access to the codebase so it can understand existing patterns, plus the ability to write design documents. The system prompt is specific to Streamline's conventions:

You are a senior software architect at Streamline Analytics. Our stack
is FastAPI + PostgreSQL + Celery for background tasks. All new features
must follow our existing patterns:
- Service layer classes in src/services/
- Pydantic models in src/models/
- API routes in src/routes/
- Celery tasks in src/tasks/
- Database models in src/db/models/

Design the scheduled report generation feature. Produce:
1. Component breakdown with clear single responsibilities
2. Interface definitions with full type signatures
3. Data flow from user configuration through to email delivery
4. Database schema changes needed
5. Key design decisions with trade-off analysis

Do NOT write implementation code. Do NOT write tests.
Focus on structure, interfaces, and integration with existing systems.

Kai provides the Architect with the feature requirements from the product manager, the existing codebase structure, and the current database schema. After approximately 45 seconds of processing, the Architect produces a detailed design document.

What the Architect Produces

The Architect's output is a structured design document with five sections. For the component breakdown, it identifies six components:

  1. ReportScheduleService -- manages CRUD operations for report schedules, including validation that the requested report type exists and the cron expression is valid.
  2. ReportGenerationService -- orchestrates the report generation process by assembling data from the appropriate data sources, applying formatting, and producing the raw report content.
  3. ReportRendererService -- converts raw report content into PDF format using the existing WeasyPrint dependency that Streamline already uses for invoice generation.
  4. ReportDeliveryService -- sends generated reports via email using the existing EmailService, with retry logic for transient failures.
  5. ScheduledReportTask -- a Celery task that the scheduler invokes at the configured time, coordinating the generation, rendering, and delivery pipeline.
  6. ReportScheduleModel -- a new database model storing schedule configurations including report type, cron expression, recipient list, and status.

The interface definitions are precise. For example:

class ReportScheduleService:
    async def create_schedule(
        self, user_id: int, config: ReportScheduleCreate
    ) -> ReportSchedule:
        """Create a new report schedule.

        Raises:
            InvalidReportTypeError: If report_type is not recognized.
            InvalidCronExpressionError: If cron_expression is malformed.
            ScheduleLimitExceededError: If user exceeds max schedules (10).
        """

    async def update_schedule(
        self, schedule_id: int, user_id: int, updates: ReportScheduleUpdate
    ) -> ReportSchedule:
        """Update an existing schedule. Only the owner can modify."""

    async def delete_schedule(
        self, schedule_id: int, user_id: int
    ) -> None:
        """Soft-delete a schedule by setting status to 'inactive'."""

    async def list_schedules(
        self, user_id: int, active_only: bool = True
    ) -> list[ReportSchedule]:
        """List all schedules for a user."""

The Architect makes two notable design decisions. First, it recommends soft-deleting schedules rather than hard-deleting them, reasoning that customers may want to restore accidentally deleted schedules and that the audit trail is valuable. Second, it recommends a per-user limit of 10 active schedules to prevent resource exhaustion from a single customer configuring hundreds of reports.

The data flow section traces the path from "user creates schedule via API" through "Celery beat picks up scheduled task" to "report generated, rendered to PDF, emailed to recipients." The Architect explicitly notes that the Celery beat scheduler should use the database as its schedule source rather than the default file-based configuration, because Streamline's schedules are dynamic and user-configured.

The Coder Agent

Kai feeds the Architect's design document to the Coder agent. The Coder's system prompt is:

You are a senior Python developer at Streamline Analytics. Implement
the design document provided by the architect. Follow these rules:
- Use the existing project structure (services in src/services/, etc.)
- Follow our FastAPI conventions for route definitions
- Use SQLAlchemy 2.0 async patterns matching existing models
- Include complete type hints and docstrings
- Handle all error cases defined in the interface specifications
- Follow PEP 8 strictly

Do NOT redesign the architecture. If you spot a potential design issue,
add a # DESIGN NOTE comment but implement as specified.
Do NOT write tests.

The Coder has access to the existing codebase for reference, plus write access to create new files. It produces six files totaling approximately 650 lines of Python:

  • src/db/models/report_schedule.py -- the SQLAlchemy model
  • src/models/report_schedule.py -- the Pydantic request and response models
  • src/services/report_schedule.py -- the schedule CRUD service
  • src/services/report_generation.py -- the generation and rendering services
  • src/services/report_delivery.py -- the email delivery service
  • src/tasks/scheduled_reports.py -- the Celery task
  • src/routes/report_schedules.py -- the FastAPI routes

The Coder follows the Architect's design precisely. It implements soft deletion as specified, includes the per-user schedule limit with an appropriate error, and uses the database-backed Celery beat scheduler. Notably, the Coder adds a # DESIGN NOTE comment on the PDF rendering section:

# DESIGN NOTE: The design specifies using WeasyPrint for PDF rendering,
# but WeasyPrint requires system-level dependencies (cairo, pango) that
# may complicate containerized deployment. Consider fpdf2 as a lighter
# alternative if deployment issues arise. Implementing with WeasyPrint
# as specified.

This is exactly the kind of feedback that a multi-agent system should produce -- the Coder respects the design boundary but documents a potential issue for the team to consider.

The Tester Agent

The Tester agent receives both the design document and the implementation code. Its system prompt focuses it on adversarial thinking:

You are a senior QA engineer at Streamline Analytics. Write comprehensive
tests for the implementation provided. Our testing conventions:
- Use pytest with async support (pytest-asyncio)
- Use factory_boy for test data factories
- Test files go in tests/ mirroring the src/ structure
- Fixtures in tests/conftest.py

Focus on:
1. Happy path for every public method
2. Every error case in the interface specification
3. Edge cases: empty inputs, boundary values, concurrent access
4. Integration between components (schedule -> generation -> delivery)

Think adversarially: how can this code break?
Do NOT modify the implementation code.

The Tester produces approximately 450 lines of test code across four test files. Some of the tests are routine -- verifying that creating a schedule returns the correct fields, that listing schedules filters by user. But the adversarial prompt pushes the Tester to write several tests that catch real issues:

Test: Schedule Limit Boundary. The Tester creates exactly 10 schedules for a user, verifies the 11th is rejected, then soft-deletes one schedule and verifies that the limit now allows a new creation (since soft-deleted schedules should not count against the limit). This test reveals that the Coder's implementation counts all schedules, including soft-deleted ones. The limit check queries WHERE user_id = ? instead of WHERE user_id = ? AND status = 'active'.

Test: Concurrent Schedule Creation. The Tester writes a test that attempts to create two schedules simultaneously for a user already at 9 schedules. Without proper locking, both requests could see 9 existing schedules and both succeed, violating the 10-schedule limit. This test exposes a race condition that the Coder did not handle.

Test: Invalid Cron Expressions. The Tester tries expressions like * * * * * * (six fields instead of five), 60 * * * * (minute value out of range), and an empty string. Two of these three cases are not handled by the current implementation, which passes the cron string directly to the Celery scheduler without pre-validation.

Test: Email Delivery Retry. The Tester simulates a transient SMTP failure on the first attempt and verifies that the delivery service retries. It then simulates three consecutive failures and verifies that the service records a permanent failure without silently dropping the report.

In total, the Tester identifies four test failures that correspond to three genuine bugs and one implementation gap. This is precisely the value of the adversarial tester role -- these issues would likely not be caught by an agent that wrote both the code and the tests.

The Reviewer Agent

The Reviewer receives the design, implementation, test results (including the four failures), and produces a structured review. Its system prompt:

You are a senior code reviewer at Streamline Analytics. Review the
implementation for quality, security, performance, and maintainability.
Reference specific files and line numbers. Categorize findings as:
- CRITICAL: Must fix before merge
- WARNING: Should fix before merge
- SUGGESTION: Consider for future improvement

Do NOT rewrite the code. Provide actionable feedback.

The Reviewer produces a report with 14 findings:

CRITICAL (2 findings): 1. The email delivery service includes recipient email addresses in log messages at INFO level. In production, this exposes PII (personally identifiable information) in logs. Recommendation: log only a masked version or use a separate PII-safe audit log. 2. The scheduled report task does not validate that the user's account is still active before generating a report. A churned customer's schedules would continue generating reports indefinitely, consuming resources.

WARNING (5 findings): 1. The report generation service makes separate database queries for each data point in the report. For a daily sales summary with 30 product categories, this produces 30+ queries. Recommendation: batch the queries or use a single aggregated query. 2. No rate limiting on the schedule creation endpoint. A malicious user could rapidly create and delete schedules, causing unnecessary database churn. 3. The Celery task does not set a hard time limit. A report generation that hangs (e.g., due to a slow database query) would occupy a worker indefinitely. Recommendation: add time_limit=300 (5 minutes) and soft_time_limit=240. 4. The PDF renderer loads WeasyPrint on every invocation. WeasyPrint initialization is expensive. Consider a singleton or pooled instance. 5. The cron expression validator accepts @daily and @weekly shortcuts but the API documentation does not mention these. Either remove support or document them.

SUGGESTION (7 findings): Formatting improvements, docstring completeness, import ordering, and two opportunities to use existing utility functions from the codebase instead of re-implementing similar logic.

The Feedback Loop

With the Tester's four failures and the Reviewer's findings in hand, Kai routes the feedback back to the Coder agent for a second implementation pass. The Coder receives a focused prompt:

Address these issues in the implementation:
1. Fix the schedule limit check to exclude soft-deleted schedules
2. Add database-level locking for concurrent schedule creation
3. Add cron expression validation before passing to Celery
4. Add email retry failure recording
5. Mask email addresses in log messages
6. Add active-account check before report generation

The Coder produces an updated implementation that addresses all six issues. Kai then runs the Tester again against the updated code. This time, all tests pass. The Tester also writes additional tests for the new behaviors (masked logging, account status check). The Reviewer does a lighter second-pass review and approves with two remaining suggestions for future improvement.

Results and Analysis

What the Multi-Agent System Caught

The four-agent approach identified issues across four categories that a single agent would almost certainly have missed:

  • Logic bugs (3): The schedule limit counting bug, the race condition, and the missing cron validation. A single agent that wrote the code would have had the same blind spots when testing it.
  • Security issues (1): PII exposure in logs. The dedicated reviewer, prompted to think about security, caught this because it was examining the code with a security-focused lens rather than the implementation-focused lens of the coder.
  • Operational issues (2): The missing Celery task timeout and the missing account-status check. These are issues that only surface in production and are easy to overlook when focused on correctness.
  • Performance issues (1): The N+1 query pattern. The reviewer, examining the code from a performance perspective, noticed a pattern that the coder optimized for clarity rather than efficiency.

Quantitative Comparison

Kai ran a controlled experiment. He asked a single agent (same model, same context about the codebase) to design, implement, test, and review the same feature in a single session. The comparison:

Metric Single Agent Four-Agent Team
Design document quality Adequate but missed soft-delete consideration Comprehensive with trade-off analysis
Implementation bugs found 1 (cron validation only) 3 logic bugs + 1 security issue
Review findings 4 suggestions, 0 critical 2 critical, 5 warnings, 7 suggestions
Total pipeline time 3 minutes 8 minutes
Estimated API cost $0.35 | $1.80
Fix cycles needed 1 1
Final test pass rate 89% (missed edge cases) 100%

The four-agent system cost approximately 5x more and took about 2.7x longer, but it produced a significantly more robust implementation. The two critical security and operational issues it caught would have eventually surfaced as production incidents -- which would have cost far more than the $1.45 difference in API fees to diagnose and fix.

What Kai Learned

Kai documents several lessons from the experiment:

Lesson 1: Role separation creates productive tension. The Architect's design decisions constrained the Coder in useful ways. The soft-delete requirement, the schedule limit, and the database-backed scheduler were all decisions the Coder might not have made independently. Conversely, the Coder's # DESIGN NOTE about WeasyPrint provided feedback the Architect needed.

Lesson 2: Adversarial testing requires explicit prompting. The Tester only wrote the race condition and boundary tests because its system prompt specifically instructed it to "think adversarially" and test concurrent access. Without that explicit instruction, the Tester would have produced more conventional tests.

Lesson 3: The Reviewer catches different things than the Tester. The Tester found logic bugs through execution. The Reviewer found security, performance, and operational issues through analysis. These are complementary perspectives that rarely coexist in a single agent's focus.

Lesson 4: The feedback loop is where quality happens. The first pass produced code with real bugs. The Tester and Reviewer surfaced those bugs. The second Coder pass fixed them. Without the feedback loop, the pipeline would have produced the same buggy code that a single agent would have produced. The quality improvement comes not from any individual agent being better, but from the iterative correction process.

Lesson 5: Cost is justified for high-value features. For a simple bug fix, the four-agent approach would be overkill. But for a customer-facing feature that touches scheduling, email delivery, and billing-adjacent functionality, the additional $1.45 in API costs prevented issues that could have caused customer complaints, data exposure, and engineering time spent on incident response.

Long-Term Impact

Three months after deploying the scheduled reports feature, Kai reviews the production metrics. The feature has zero critical incidents. The one minor issue -- a customer entering a cron expression that generates reports every minute -- was caught by the rate limiting that the Reviewer recommended and the Coder implemented in the second pass. Without the multi-agent review process, that safeguard would not have existed.

Kai adopts the four-agent pattern as the standard approach for all features that span multiple components or touch security-sensitive code. For simpler tasks, the team continues to use single-agent workflows. The key insight: multi-agent systems are not a replacement for single agents but a complement, used when the task's complexity and risk justify the additional coordination overhead.

The team creates a shared configuration for their four-agent setup, including the system prompts, tool access matrices, and feedback loop parameters. New team members can use the same multi-agent pipeline with no additional configuration, making the quality improvements systematic rather than dependent on individual developer skill.