Case Study 2: Multi-Agent System for Data Analysis

Chapter 32: AI Agents and Tool Use

Overview

Organization: Meridian Retail Group, an e-commerce company with 50 million monthly transactions across 12 regional markets. Challenge: The business intelligence (BI) team receives approximately 80 ad-hoc data analysis requests per week from executives and product managers. Each request takes 2-8 hours to fulfill, creating a persistent 2-week backlog. Non-technical stakeholders cannot self-serve because the data is spread across multiple systems with complex schemas. Goal: Build a multi-agent data analysis system that allows non-technical users to ask questions in natural language and receive accurate, well-visualized answers within minutes.

Problem Analysis

Meridian's data challenges were multifaceted:

Data fragmentation: Transaction data lived in PostgreSQL, customer data in a data warehouse (BigQuery), marketing data in Google Analytics, and inventory data in a custom API.
Schema complexity: The transaction database alone had 47 tables with non-obvious naming conventions (e.g., txn_li_adj for "transaction line item adjustments").
Analysis ambiguity: Requests like "How are we performing in Q3?" required interpreting which metrics, which comparison period, and which markets the requester meant.
Visualization needs: Executives expected charts and dashboards, not raw numbers.
Accuracy requirements: Incorrect numbers could lead to bad business decisions. The BI team reported that 8% of their analyses required corrections after initial delivery.

A survey of stakeholders revealed: - 72% of requests were variations of previously-answered questions with different filters. - 45% of BI team time was spent on query translation (understanding what the requester actually wanted). - Average time from request to delivery: 4.3 days.

System Architecture

Why Multi-Agent?

The team initially attempted a single-agent approach but encountered three problems:

Context overflow: A single agent needed to hold schema definitions for all four data sources, which exceeded manageable prompt sizes.
Skill specialization: SQL generation, data interpretation, and visualization creation required different expertise.
Error isolation: When the single agent made a SQL error, it often corrupted its reasoning about the visualization too.

A multi-agent architecture addressed all three issues through specialization, isolated contexts, and modular error handling.

Agent Roles

The system consisted of four agents coordinated by a manager:

                      ┌────────────────────┐
                      │   Manager Agent     │
                      │                     │
                      │ - Interprets user   │
                      │   request           │
                      │ - Creates plan      │
                      │ - Delegates tasks   │
                      │ - Synthesizes       │
                      │   results           │
                      └────────┬────────────┘
               ┌───────────────┼───────────────┐
               ▼               ▼               ▼
  ┌──────────────────┐ ┌──────────────┐ ┌───────────────┐
  │  SQL Agent       │ │ Analysis     │ │ Visualization │
  │                  │ │ Agent        │ │ Agent         │
  │ - Schema access  │ │ - Statistics │ │ - Chart       │
  │ - Query gen      │ │ - Trends     │ │   generation  │
  │ - Query exec     │ │ - Insights   │ │ - Formatting  │
  │ - Error fixing   │ │ - Anomalies  │ │ - Dashboards  │
  └──────────────────┘ └──────────────┘ └───────────────┘
               │
               ▼
  ┌──────────────────┐
  │ Clarification    │
  │ Agent            │
  │                  │
  │ - Disambiguate   │
  │ - Suggest metrics│
  │ - Confirm scope  │
  └──────────────────┘

Manager Agent: Receives the user's natural language question, determines whether clarification is needed, creates a plan, delegates to specialized agents, and synthesizes the final response. Uses a high-capability model (GPT-4 or Claude).

Clarification Agent: When the user's request is ambiguous, generates clarifying questions. For example, "How are sales doing?" triggers: "Which metric would you like? (a) Total revenue, (b) Number of transactions, (c) Average order value. And for which time period and markets?"

SQL Agent: Generates and executes SQL queries against the appropriate data source. Has access to schema definitions, sample data, and a library of common query patterns. Uses an iterative approach: generate query, execute, check results, fix if needed.

Analysis Agent: Takes raw query results and performs statistical analysis: trend identification, year-over-year comparison, anomaly detection, segmentation analysis, and natural language interpretation of the numbers.

Visualization Agent: Takes analysis results and generates appropriate visualizations using Python (matplotlib/plotly). Selects chart types based on data characteristics (time series → line chart, comparison → bar chart, distribution → histogram).

Communication Protocol

Agents communicate through a shared state object:

@dataclass
class AnalysisState:
    """Shared state for the multi-agent analysis pipeline."""
    request_id: str
    user_query: str
    clarified_query: str | None = None
    plan: list[str] = field(default_factory=list)
    sql_queries: list[dict] = field(default_factory=list)
    query_results: list[dict] = field(default_factory=list)
    analysis: dict = field(default_factory=dict)
    visualizations: list[str] = field(default_factory=list)
    final_response: str | None = None
    errors: list[dict] = field(default_factory=list)
    metadata: dict = field(default_factory=dict)

Each agent reads from and writes to this shared state. The manager agent controls the flow, deciding which agent to invoke next based on the current state.

Implementation Details

SQL Agent: Iterative Query Generation

The SQL Agent was the most complex component. It followed a four-step process:

Step 1: Schema Selection. Given the user's question, select the relevant tables from a schema catalog. The catalog stored table descriptions, column descriptions, and sample values. Rather than including all 47 tables in the prompt, the agent used embedding similarity to select the 5-10 most relevant tables.

Step 2: Query Generation. Generate SQL using the selected schema context and the clarified user question. The prompt included 5 few-shot examples of similar queries.

Step 3: Query Validation. Before executing, validate the query: - Syntax check using sqlparse. - Table and column name verification against the schema. - Safety check: reject queries with DROP, DELETE, UPDATE, INSERT, or other write operations. - Cost estimation: reject queries estimated to scan more than 10GB.

Step 4: Execution and Error Recovery. Execute the query and handle errors iteratively:

Generate SQL → Validate → Execute → Success? → Return results
                                      │
                                    Error
                                      │
                              Analyze error message
                                      │
                              Fix SQL (up to 3 attempts)
                                      │
                              Re-execute

Common errors and fixes: - column not found: Check for typos, suggest similar column names. - ambiguous column reference: Add table aliases. - timeout: Add LIMIT, simplify JOINs, or add date filters. - division by zero: Add NULLIF or CASE WHEN guards.

Analysis Agent: Statistical Interpretation

The Analysis Agent received raw query results and produced structured insights:

analysis_prompt = """
Given the following data query results, provide a comprehensive analysis:

Data:
{query_results}

Context:
- User question: {user_query}
- Time period: {time_period}
- Comparison period: {comparison_period}

Provide:
1. Key metrics summary (headline numbers)
2. Trends (increasing, decreasing, stable, with magnitude)
3. Comparisons to previous period (absolute and percentage change)
4. Notable anomalies or outliers
5. Segment breakdowns (if applicable)
6. Recommended next analyses

Format your response as structured JSON.
"""

The Analysis Agent also performed basic statistical tests: - Trend detection: Linear regression on time series data to identify significant trends. - Anomaly detection: Z-score analysis to flag data points more than 2 standard deviations from the mean. - Segmentation: Breakdown by key dimensions (market, product category, customer segment).

Visualization Agent: Automated Chart Selection

The Visualization Agent selected chart types based on data characteristics:

Data Type	Recommended Chart	Conditions
Time series (single)	Line chart	Continuous time dimension
Time series (multiple)	Multi-line chart	2-6 series
Comparison (few categories)	Bar chart	2-15 categories
Comparison (many categories)	Horizontal bar	15+ categories
Distribution	Histogram	Single continuous variable
Proportion	Pie / donut chart	2-6 categories summing to 100%
Correlation	Scatter plot	Two continuous variables
Geographic	Choropleth map	Data with geographic dimension

The agent generated Python code using plotly, executed it in a sandboxed environment, and returned the chart as a PNG image and an interactive HTML file.

Example Walkthrough

User query: "How did our top 5 product categories perform in Q3 compared to Q2, broken down by region?"

Manager Agent:

Plan:
1. [SQL Agent] Query Q2 and Q3 revenue by product category and region
2. [SQL Agent] Identify top 5 categories by Q3 revenue
3. [Analysis Agent] Compute QoQ changes, identify trends by region
4. [Visualization Agent] Create comparison charts
5. [Manager] Synthesize narrative response

SQL Agent (Step 1 - Top categories):

SELECT
    pc.category_name,
    SUM(CASE WHEN t.transaction_date BETWEEN '2024-07-01' AND '2024-09-30'
             THEN tli.line_total ELSE 0 END) AS q3_revenue,
    SUM(CASE WHEN t.transaction_date BETWEEN '2024-04-01' AND '2024-06-30'
             THEN tli.line_total ELSE 0 END) AS q2_revenue
FROM transactions t
JOIN transaction_line_items tli ON t.txn_id = tli.txn_id
JOIN products p ON tli.product_id = p.product_id
JOIN product_categories pc ON p.category_id = pc.category_id
WHERE t.transaction_date BETWEEN '2024-04-01' AND '2024-09-30'
GROUP BY pc.category_name
ORDER BY q3_revenue DESC
LIMIT 5;

SQL Agent (Step 2 - Regional breakdown):

SELECT
    pc.category_name,
    r.region_name,
    SUM(CASE WHEN t.transaction_date BETWEEN '2024-07-01' AND '2024-09-30'
             THEN tli.line_total ELSE 0 END) AS q3_revenue,
    SUM(CASE WHEN t.transaction_date BETWEEN '2024-04-01' AND '2024-06-30'
             THEN tli.line_total ELSE 0 END) AS q2_revenue
FROM transactions t
JOIN transaction_line_items tli ON t.txn_id = tli.txn_id
JOIN products p ON tli.product_id = p.product_id
JOIN product_categories pc ON p.category_id = pc.category_id
JOIN regions r ON t.region_id = r.region_id
WHERE pc.category_name IN ('Electronics', 'Home & Garden', 'Fashion', 'Health', 'Sports')
  AND t.transaction_date BETWEEN '2024-04-01' AND '2024-09-30'
GROUP BY pc.category_name, r.region_name
ORDER BY pc.category_name, q3_revenue DESC;

Analysis Agent output:

{
  "headline": "Top 5 categories generated $142M in Q3, up 8.3% from Q2",
  "category_performance": [
    {"name": "Electronics", "q3": "$52.1M", "q2": "$48.3M", "change": "+7.9%"},
    {"name": "Home & Garden", "q3": "$31.4M", "q2": "$28.1M", "change": "+11.7%"},
    {"name": "Fashion", "q3": "$28.7M", "q2": "$30.2M", "change": "-5.0%"},
    {"name": "Health", "q3": "$16.8M", "q2": "$14.9M", "change": "+12.8%"},
    {"name": "Sports", "q3": "$13.0M", "q2": "$9.6M", "change": "+35.4%"}
  ],
  "regional_insights": [
    "North America drove 62% of growth, led by Electronics (+12% QoQ)",
    "Europe showed declining Fashion sales across all regions (-8% avg)",
    "APAC Sports category grew 52% QoQ, the fastest growing segment"
  ],
  "anomalies": [
    "Fashion decline unusual for Q3 (typically a strong quarter)"
  ],
  "recommended_next": [
    "Investigate Fashion decline: pricing changes? competitor activity?",
    "Deep dive into APAC Sports growth drivers"
  ]
}

Visualization Agent produced three charts: 1. Grouped bar chart: Q2 vs. Q3 revenue by category. 2. Heatmap: QoQ percentage change by category and region. 3. Stacked bar chart: Regional contribution to each category.

Manager Agent synthesized a final narrative response with charts embedded, key insights highlighted, and recommendations for further analysis.

Results

Quantitative Evaluation

The system was evaluated on 200 historical analysis requests:

Metric	BI Team	Multi-Agent System	Improvement
Median response time	4.3 days	3.2 minutes	99.9%
SQL accuracy (correct results)	92%	88%	-4%
SQL accuracy after error recovery	92%	95%	+3%
Analysis quality (1-5, expert rated)	4.2	3.9	-7%
Chart quality (1-5)	4.0	4.1	+2.5%
Cost per request	$320* \| $3.50	99%

* Based on BI analyst salary, overhead, and tooling costs.

Key observations: - Initial SQL accuracy was lower than human analysts, but the iterative error recovery process brought accuracy above the human baseline. - Analysis quality was slightly lower because the agent sometimes missed context that experienced analysts would know (e.g., a marketing campaign that explains a revenue spike). - Chart quality was rated slightly higher because the agent consistently chose appropriate chart types, whereas human analysts sometimes defaulted to suboptimal choices.

Impact on Operations

After three months of deployment: - The BI request backlog was eliminated within 2 weeks. - 65% of requests were fully resolved by the agent without human review. - 30% required minor corrections from a BI analyst (average review time: 15 minutes). - 5% required substantial rework (complex multi-dataset joins, unusual metrics). - The BI team refocused from ad-hoc requests to strategic dashboard development and data infrastructure improvements.

Technical Challenges and Solutions

Challenge 1: Schema Understanding

Problem: The SQL Agent frequently generated queries referencing non-existent columns or using incorrect join paths.

Solution: Built a "schema assistant" that preprocessed the schema into a richly annotated format: - Added natural language descriptions to all tables and columns. - Documented common join paths. - Provided sample values for enumerated columns. - Created a "query cookbook" with 50 example queries covering common patterns.

Embedding the enriched schema descriptions and using semantic search to select relevant tables reduced schema-related errors by 73%.

Challenge 2: Ambiguous Requests

Problem: Requests like "Show me customer trends" were too vague. The agent would guess (often incorrectly) what metrics and timeframes the user wanted.

Solution: Implemented the Clarification Agent as a mandatory first step for requests below a confidence threshold. The manager agent scored request clarity on a 1-5 scale; requests scoring below 3 triggered clarification. This added a round-trip to the conversation but dramatically improved accuracy.

Challenge 3: Agent Coordination Overhead

Problem: The four-agent system sometimes spent more tokens on inter-agent communication than on the actual analysis.

Solution: Implemented a "fast path" for simple requests. If the manager determined the request was straightforward (single table, single metric, no ambiguity), it bypassed the multi-agent pipeline and used a single SQL+visualization call. This handled 40% of requests at one-quarter of the cost.

Challenge 4: Result Consistency

Problem: Running the same query twice sometimes produced slightly different analyses because the Analysis Agent's interpretation was non-deterministic.

Solution: Set temperature to 0 for the SQL Agent (deterministic queries) and 0.3 for the Analysis Agent (some creativity in interpretation, but mostly consistent). Added a result caching layer so identical queries within a 1-hour window returned the same results.

Lessons Learned

Specialization through multi-agent design pays off for complex domains. Each agent could be optimized independently: the SQL Agent with schema-focused prompts, the Analysis Agent with statistical methods, and the Visualization Agent with chart design patterns.
The "fast path" is essential for cost control. Not every request needs the full multi-agent pipeline. A simple router that identifies straightforward requests and handles them with a single agent reduced average cost by 40%.
Schema quality is the foundation. Investing in rich schema annotations and a query cookbook had more impact on accuracy than any amount of prompt engineering.
Clarification beats assumption. The Clarification Agent initially seemed like friction, but it improved accuracy from 78% to 95% on ambiguous requests. Users preferred being asked one clarifying question over receiving an incorrect analysis.
Error recovery is the SQL Agent's superpower. The iterative generate-execute-fix loop allowed the agent to achieve higher accuracy than the initial generation alone. The key was providing the error message and the failed query to the model for analysis.
Multi-agent systems need observability. With four agents, debugging failures required tracing through multiple agent calls. The team built a trace viewer that showed the full inter-agent communication for each request, which was essential for identifying and fixing issues.

Key Takeaways

Multi-agent architectures excel when tasks require diverse expertise that benefits from isolation and specialization.
Shared state (blackboard architecture) provides a clean coordination mechanism that avoids complex message passing.
Schema quality and enrichment are the most impactful investments for data analysis agents.
Fast-path routing for simple requests dramatically reduces average cost in multi-agent systems.
The combination of clarification, iterative error recovery, and human review creates a robust production system.