Case Study 1: Building a Research Assistant Agent

Chapter 32: AI Agents and Tool Use

Overview

Organization: NovaSci Analytics, a biotech consulting firm with 120 analysts. Challenge: Analysts spend 4-6 hours per research request, manually searching PubMed, parsing PDFs, cross-referencing findings, and writing summaries. The firm handles approximately 200 research requests per month, and the growing backlog threatens client satisfaction. Goal: Build an AI research assistant agent that reduces research time by 70% while maintaining the quality and accuracy standards required in biotech consulting.

Problem Analysis

NovaSci's research workflow involved several pain points:

Manual search: Analysts manually constructed PubMed queries, often missing relevant papers due to suboptimal search terms.
PDF processing: Downloaded papers had to be manually read and excerpted. A single request might involve 10-30 papers.
Cross-referencing: Findings from different papers needed to be reconciled, with contradictions identified and resolved.
Citation tracking: Every claim in the final report needed a proper citation, requiring meticulous manual tracking.
Quality variance: Junior analysts sometimes missed key papers or misinterpreted findings, leading to inconsistent quality.

An internal audit revealed: - Average time per research request: 5.2 hours. - 15% of reports required revision due to missed papers or incorrect interpretations. - Analysts spent 60% of their time on mechanical tasks (searching, reading, excerpting) rather than higher-order analysis.

System Architecture

Design Goals

The team established the following requirements:

Requirement	Specification
Accuracy	All citations verifiable; zero hallucinated references
Coverage	Find at least 90% of papers a senior analyst would find
Speed	Complete initial research in under 30 minutes
Traceability	Every claim linked to a specific paper and passage
Human oversight	Analyst reviews and approves before client delivery

Agent Architecture

The research assistant was designed as a ReAct agent with five tools:

┌─────────────────────────────────────────────────────┐
│              Research Assistant Agent                 │
│                                                       │
│  System Prompt: You are a biotech research analyst.  │
│  Follow the ReAct pattern to systematically          │
│  research the given topic.                           │
│                                                       │
│  Tools:                                              │
│  ┌─────────────┐ ┌─────────────┐ ┌──────────────┐  │
│  │ search_pubmed│ │ fetch_paper │ │ extract_data │  │
│  └─────────────┘ └─────────────┘ └──────────────┘  │
│  ┌──────────────┐ ┌───────────────┐                 │
│  │ search_web   │ │ write_section │                 │
│  └──────────────┘ └───────────────┘                 │
│                                                       │
│  Memory: Working scratchpad + findings database      │
└─────────────────────────────────────────────────────┘

Tool Descriptions:

search_pubmed(query, max_results, date_range): Searches PubMed using the Entrez API. Returns paper titles, abstracts, PMIDs, and publication dates.
fetch_paper(pmid): Retrieves the full text of a paper from PubMed Central (when available) or the abstract. Returns structured sections (introduction, methods, results, discussion).
extract_data(text, extraction_prompt): Uses an LLM to extract specific information from a paper section (e.g., "Extract all reported efficacy percentages and their confidence intervals").
search_web(query): Searches the web for supplementary information such as clinical trial registrations, FDA documents, or press releases.
write_section(title, content, citations): Writes a section of the research report with proper citations.

Planning Strategy

The agent follows a structured research methodology:

Phase 1: Query Formulation
  - Decompose the research question into sub-questions
  - Generate multiple search queries per sub-question
  - Identify key terms, synonyms, and MeSH headings

Phase 2: Literature Search
  - Execute searches across PubMed and web sources
  - Collect candidate papers (target: 30-50 per topic)
  - Rank by relevance and recency

Phase 3: Deep Reading
  - Fetch full text of top 15-20 papers
  - Extract key findings, methods, and statistics
  - Store in structured findings database

Phase 4: Synthesis
  - Group findings by sub-question
  - Identify consensus, contradictions, and gaps
  - Draft report sections

Phase 5: Quality Check
  - Verify all citations exist and are correctly attributed
  - Check for logical consistency
  - Flag areas of uncertainty for analyst review

Implementation Details

The ReAct Loop in Practice

Here is an abbreviated trace from an actual research request: "Summarize recent evidence on GLP-1 receptor agonists for non-alcoholic fatty liver disease (NAFLD)."

Thought 1: I need to research GLP-1 receptor agonists (GLP-1 RAs) for NAFLD/NASH.
Let me start by searching PubMed with multiple queries to ensure comprehensive coverage.
I'll begin with the most specific query and broaden if needed.

Action 1: search_pubmed(
    query="GLP-1 receptor agonist NAFLD NASH clinical trial",
    max_results=20,
    date_range="2020-2024"
)
Observation 1: Found 18 results including:
  - PMID:38234561 "Semaglutide for NASH: Phase 3 Results" (2024)
  - PMID:37891234 "Liraglutide reduces hepatic steatosis..." (2023)
  - PMID:37456789 "Meta-analysis of GLP-1 RAs in NAFLD" (2023)
  ... [15 more results]

Thought 2: Good initial results. I have a meta-analysis (PMID:37456789) which will
be very valuable. Let me also search with broader terms to catch papers that might
use different terminology (e.g., MASLD, the new nomenclature).

Action 2: search_pubmed(
    query="GLP-1 agonist MASLD steatotic liver disease",
    max_results=20,
    date_range="2022-2024"
)
Observation 2: Found 12 results, 7 new (not in previous results):
  - PMID:38567890 "MASLD treatment with incretin-based therapies" (2024)
  - PMID:38123456 "Tirzepatide dual agonist liver outcomes" (2024)
  ... [5 more new results]

Thought 3: Good, I caught papers using the new MASLD nomenclature. Now let me
fetch the meta-analysis first as it will give me the best overview of the evidence.

Action 3: fetch_paper(pmid="37456789")
Observation 3: Full text retrieved. Key findings:
  - 14 RCTs included (n=2,345 patients)
  - GLP-1 RAs significantly reduced hepatic steatosis (MD: -3.92%, 95% CI: -5.21 to -2.63)
  - Histological improvement in NASH: OR 3.45 (95% CI: 2.12-5.62)
  - Significant reduction in ALT levels: MD -12.3 IU/L (95% CI: -16.8 to -7.8)
  ...

Thought 4: The meta-analysis provides strong quantitative evidence. Now I should
read the Phase 3 semaglutide trial for the most recent high-quality evidence,
and also the tirzepatide paper since dual agonists are emerging.
...

The full trace continued for 23 steps over approximately 18 minutes, ultimately producing a 12-page research report with 24 citations.

Working Memory Design

The agent maintained a structured scratchpad throughout the research process:

{
  "research_question": "GLP-1 RAs for NAFLD/NASH",
  "sub_questions": [
    "What is the clinical efficacy of GLP-1 RAs for NAFLD?",
    "Which GLP-1 RAs have the strongest evidence?",
    "What are the mechanisms of action in liver disease?",
    "What are the safety considerations?"
  ],
  "papers_reviewed": 19,
  "key_findings": [
    {
      "finding": "GLP-1 RAs reduce hepatic steatosis by ~4%",
      "source": "PMID:37456789",
      "confidence": "high (meta-analysis of 14 RCTs)"
    },
    {
      "finding": "Semaglutide 0.4mg achieved NASH resolution in 59% vs 17% placebo",
      "source": "PMID:38234561",
      "confidence": "high (Phase 3 RCT)"
    }
  ],
  "contradictions": [
    "Fibrosis improvement: mixed results across studies"
  ],
  "gaps": [
    "Limited long-term data beyond 72 weeks",
    "No head-to-head trials between different GLP-1 RAs"
  ]
}

Error Handling

The agent encountered several errors during development and testing:

Problem 1: Full text unavailable. Many papers are behind paywalls. When fetch_paper failed to retrieve full text, the agent fell back to analyzing the abstract and searching for the paper on preprint servers.

Problem 2: Contradictory findings. Different papers reported different efficacy numbers. The agent was instructed to prioritize meta-analyses over individual studies and to explicitly flag contradictions in the report.

Problem 3: Hallucinated citations. In early testing, the agent sometimes generated plausible-sounding but non-existent paper references. The solution was a citation verification step: every PMID cited in the report was verified against the actual search results and fetched papers stored in the scratchpad.

Results

Quantitative Evaluation

The team evaluated the agent on 50 research requests previously completed by analysts:

Metric	Human Analyst	Agent	Agent + Review
Average time	5.2 hours	22 min	1.4 hours
Papers found (recall)	Baseline	87%	94%
Citation accuracy	98%	100%	100%
Report quality (1-5)	4.1	3.6	4.3
Cost per request	$390* \| $4.80	$105**

* Based on analyst salary and overhead. ** Includes 1 hour of analyst review time.

The "Agent + Review" workflow, where an analyst reviews and edits the agent's output, proved optimal: it was significantly faster and cheaper than pure human work, and the quality slightly exceeded the human-only baseline because the agent's comprehensive search uncovered papers that analysts occasionally missed.

Qualitative Findings

Coverage advantage: The agent consistently searched with more query variations than human analysts, occasionally finding relevant papers that analysts missed.
Consistency: Unlike human analysts whose quality varied with experience, the agent produced consistently structured reports.
Limitation on nuance: The agent sometimes missed subtle methodological flaws that experienced analysts would catch. Human review remained essential.
Speed-quality tradeoff: Allowing the agent more iterations (up to 40 steps) improved quality but increased cost and time. The team settled on a 25-step limit as the sweet spot.

Lessons Learned

Citation verification is non-negotiable. The single most impactful design decision was requiring every citation to be verified against actually-retrieved papers. This eliminated hallucinated references entirely.
Structured scratchpads outperform free-form context. Switching from free-form context accumulation to a structured JSON scratchpad improved the agent's ability to synthesize across papers by 35% (measured by analyst quality ratings).
Query diversity matters more than query count. Using 5 diverse search queries (different terminology, MeSH headings, author-based searches) outperformed 10 similar queries in paper recall.
Human-in-the-loop is a feature, not a limitation. The agent-plus-analyst workflow produced better results than either alone. The agent handles mechanical work; the analyst provides domain judgment.
Cost monitoring is essential from day one. Without monitoring, a single complex research request could trigger 50+ LLM calls costing $20+. Setting step limits and implementing caching for repeated searches kept costs predictable.

Architecture Diagram

User Request → Planner → Search Phase → Reading Phase → Synthesis Phase → Report
                  │           │              │               │
                  ▼           ▼              ▼               ▼
             Decompose    PubMed API     Fetch papers    Write sections
             into sub-    Web search     Extract data    Cross-reference
             questions                   Update          Cite sources
                                         scratchpad      Quality check
                                                              │
                                                              ▼
                                                     Analyst Review
                                                              │
                                                              ▼
                                                      Final Report

Key Takeaways

ReAct agents with structured tools can automate the mechanical aspects of literature research while maintaining rigorous citation standards.
The combination of AI agent + human review produces higher quality than either alone.
Structured working memory (scratchpad) is essential for multi-step research tasks that require synthesizing information from many sources.
Citation hallucination is a critical risk that requires explicit verification mechanisms.
Agent cost scales with task complexity; monitoring and limits are essential for production deployment.