Case Study 1: Building a Research Assistant Agent
Chapter 32: AI Agents and Tool Use
Overview
Organization: NovaSci Analytics, a biotech consulting firm with 120 analysts. Challenge: Analysts spend 4-6 hours per research request, manually searching PubMed, parsing PDFs, cross-referencing findings, and writing summaries. The firm handles approximately 200 research requests per month, and the growing backlog threatens client satisfaction. Goal: Build an AI research assistant agent that reduces research time by 70% while maintaining the quality and accuracy standards required in biotech consulting.
Problem Analysis
NovaSci's research workflow involved several pain points:
- Manual search: Analysts manually constructed PubMed queries, often missing relevant papers due to suboptimal search terms.
- PDF processing: Downloaded papers had to be manually read and excerpted. A single request might involve 10-30 papers.
- Cross-referencing: Findings from different papers needed to be reconciled, with contradictions identified and resolved.
- Citation tracking: Every claim in the final report needed a proper citation, requiring meticulous manual tracking.
- Quality variance: Junior analysts sometimes missed key papers or misinterpreted findings, leading to inconsistent quality.
An internal audit revealed: - Average time per research request: 5.2 hours. - 15% of reports required revision due to missed papers or incorrect interpretations. - Analysts spent 60% of their time on mechanical tasks (searching, reading, excerpting) rather than higher-order analysis.
System Architecture
Design Goals
The team established the following requirements:
| Requirement | Specification |
|---|---|
| Accuracy | All citations verifiable; zero hallucinated references |
| Coverage | Find at least 90% of papers a senior analyst would find |
| Speed | Complete initial research in under 30 minutes |
| Traceability | Every claim linked to a specific paper and passage |
| Human oversight | Analyst reviews and approves before client delivery |
Agent Architecture
The research assistant was designed as a ReAct agent with five tools:
┌─────────────────────────────────────────────────────┐
│ Research Assistant Agent │
│ │
│ System Prompt: You are a biotech research analyst. │
│ Follow the ReAct pattern to systematically │
│ research the given topic. │
│ │
│ Tools: │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │ search_pubmed│ │ fetch_paper │ │ extract_data │ │
│ └─────────────┘ └─────────────┘ └──────────────┘ │
│ ┌──────────────┐ ┌───────────────┐ │
│ │ search_web │ │ write_section │ │
│ └──────────────┘ └───────────────┘ │
│ │
│ Memory: Working scratchpad + findings database │
└─────────────────────────────────────────────────────┘
Tool Descriptions:
-
search_pubmed(query, max_results, date_range): Searches PubMed using the Entrez API. Returns paper titles, abstracts, PMIDs, and publication dates.
-
fetch_paper(pmid): Retrieves the full text of a paper from PubMed Central (when available) or the abstract. Returns structured sections (introduction, methods, results, discussion).
-
extract_data(text, extraction_prompt): Uses an LLM to extract specific information from a paper section (e.g., "Extract all reported efficacy percentages and their confidence intervals").
-
search_web(query): Searches the web for supplementary information such as clinical trial registrations, FDA documents, or press releases.
-
write_section(title, content, citations): Writes a section of the research report with proper citations.
Planning Strategy
The agent follows a structured research methodology:
Phase 1: Query Formulation
- Decompose the research question into sub-questions
- Generate multiple search queries per sub-question
- Identify key terms, synonyms, and MeSH headings
Phase 2: Literature Search
- Execute searches across PubMed and web sources
- Collect candidate papers (target: 30-50 per topic)
- Rank by relevance and recency
Phase 3: Deep Reading
- Fetch full text of top 15-20 papers
- Extract key findings, methods, and statistics
- Store in structured findings database
Phase 4: Synthesis
- Group findings by sub-question
- Identify consensus, contradictions, and gaps
- Draft report sections
Phase 5: Quality Check
- Verify all citations exist and are correctly attributed
- Check for logical consistency
- Flag areas of uncertainty for analyst review
Implementation Details
The ReAct Loop in Practice
Here is an abbreviated trace from an actual research request: "Summarize recent evidence on GLP-1 receptor agonists for non-alcoholic fatty liver disease (NAFLD)."
Thought 1: I need to research GLP-1 receptor agonists (GLP-1 RAs) for NAFLD/NASH.
Let me start by searching PubMed with multiple queries to ensure comprehensive coverage.
I'll begin with the most specific query and broaden if needed.
Action 1: search_pubmed(
query="GLP-1 receptor agonist NAFLD NASH clinical trial",
max_results=20,
date_range="2020-2024"
)
Observation 1: Found 18 results including:
- PMID:38234561 "Semaglutide for NASH: Phase 3 Results" (2024)
- PMID:37891234 "Liraglutide reduces hepatic steatosis..." (2023)
- PMID:37456789 "Meta-analysis of GLP-1 RAs in NAFLD" (2023)
... [15 more results]
Thought 2: Good initial results. I have a meta-analysis (PMID:37456789) which will
be very valuable. Let me also search with broader terms to catch papers that might
use different terminology (e.g., MASLD, the new nomenclature).
Action 2: search_pubmed(
query="GLP-1 agonist MASLD steatotic liver disease",
max_results=20,
date_range="2022-2024"
)
Observation 2: Found 12 results, 7 new (not in previous results):
- PMID:38567890 "MASLD treatment with incretin-based therapies" (2024)
- PMID:38123456 "Tirzepatide dual agonist liver outcomes" (2024)
... [5 more new results]
Thought 3: Good, I caught papers using the new MASLD nomenclature. Now let me
fetch the meta-analysis first as it will give me the best overview of the evidence.
Action 3: fetch_paper(pmid="37456789")
Observation 3: Full text retrieved. Key findings:
- 14 RCTs included (n=2,345 patients)
- GLP-1 RAs significantly reduced hepatic steatosis (MD: -3.92%, 95% CI: -5.21 to -2.63)
- Histological improvement in NASH: OR 3.45 (95% CI: 2.12-5.62)
- Significant reduction in ALT levels: MD -12.3 IU/L (95% CI: -16.8 to -7.8)
...
Thought 4: The meta-analysis provides strong quantitative evidence. Now I should
read the Phase 3 semaglutide trial for the most recent high-quality evidence,
and also the tirzepatide paper since dual agonists are emerging.
...
The full trace continued for 23 steps over approximately 18 minutes, ultimately producing a 12-page research report with 24 citations.
Working Memory Design
The agent maintained a structured scratchpad throughout the research process:
{
"research_question": "GLP-1 RAs for NAFLD/NASH",
"sub_questions": [
"What is the clinical efficacy of GLP-1 RAs for NAFLD?",
"Which GLP-1 RAs have the strongest evidence?",
"What are the mechanisms of action in liver disease?",
"What are the safety considerations?"
],
"papers_reviewed": 19,
"key_findings": [
{
"finding": "GLP-1 RAs reduce hepatic steatosis by ~4%",
"source": "PMID:37456789",
"confidence": "high (meta-analysis of 14 RCTs)"
},
{
"finding": "Semaglutide 0.4mg achieved NASH resolution in 59% vs 17% placebo",
"source": "PMID:38234561",
"confidence": "high (Phase 3 RCT)"
}
],
"contradictions": [
"Fibrosis improvement: mixed results across studies"
],
"gaps": [
"Limited long-term data beyond 72 weeks",
"No head-to-head trials between different GLP-1 RAs"
]
}
Error Handling
The agent encountered several errors during development and testing:
Problem 1: Full text unavailable. Many papers are behind paywalls. When fetch_paper failed to retrieve full text, the agent fell back to analyzing the abstract and searching for the paper on preprint servers.
Problem 2: Contradictory findings. Different papers reported different efficacy numbers. The agent was instructed to prioritize meta-analyses over individual studies and to explicitly flag contradictions in the report.
Problem 3: Hallucinated citations. In early testing, the agent sometimes generated plausible-sounding but non-existent paper references. The solution was a citation verification step: every PMID cited in the report was verified against the actual search results and fetched papers stored in the scratchpad.
Results
Quantitative Evaluation
The team evaluated the agent on 50 research requests previously completed by analysts:
| Metric | Human Analyst | Agent | Agent + Review |
|---|---|---|---|
| Average time | 5.2 hours | 22 min | 1.4 hours |
| Papers found (recall) | Baseline | 87% | 94% |
| Citation accuracy | 98% | 100% | 100% |
| Report quality (1-5) | 4.1 | 3.6 | 4.3 |
| Cost per request | $390* | $4.80 | $105** |
* Based on analyst salary and overhead. ** Includes 1 hour of analyst review time.
The "Agent + Review" workflow, where an analyst reviews and edits the agent's output, proved optimal: it was significantly faster and cheaper than pure human work, and the quality slightly exceeded the human-only baseline because the agent's comprehensive search uncovered papers that analysts occasionally missed.
Qualitative Findings
- Coverage advantage: The agent consistently searched with more query variations than human analysts, occasionally finding relevant papers that analysts missed.
- Consistency: Unlike human analysts whose quality varied with experience, the agent produced consistently structured reports.
- Limitation on nuance: The agent sometimes missed subtle methodological flaws that experienced analysts would catch. Human review remained essential.
- Speed-quality tradeoff: Allowing the agent more iterations (up to 40 steps) improved quality but increased cost and time. The team settled on a 25-step limit as the sweet spot.
Lessons Learned
-
Citation verification is non-negotiable. The single most impactful design decision was requiring every citation to be verified against actually-retrieved papers. This eliminated hallucinated references entirely.
-
Structured scratchpads outperform free-form context. Switching from free-form context accumulation to a structured JSON scratchpad improved the agent's ability to synthesize across papers by 35% (measured by analyst quality ratings).
-
Query diversity matters more than query count. Using 5 diverse search queries (different terminology, MeSH headings, author-based searches) outperformed 10 similar queries in paper recall.
-
Human-in-the-loop is a feature, not a limitation. The agent-plus-analyst workflow produced better results than either alone. The agent handles mechanical work; the analyst provides domain judgment.
-
Cost monitoring is essential from day one. Without monitoring, a single complex research request could trigger 50+ LLM calls costing $20+. Setting step limits and implementing caching for repeated searches kept costs predictable.
Architecture Diagram
User Request → Planner → Search Phase → Reading Phase → Synthesis Phase → Report
│ │ │ │
▼ ▼ ▼ ▼
Decompose PubMed API Fetch papers Write sections
into sub- Web search Extract data Cross-reference
questions Update Cite sources
scratchpad Quality check
│
▼
Analyst Review
│
▼
Final Report
Key Takeaways
- ReAct agents with structured tools can automate the mechanical aspects of literature research while maintaining rigorous citation standards.
- The combination of AI agent + human review produces higher quality than either alone.
- Structured working memory (scratchpad) is essential for multi-step research tasks that require synthesizing information from many sources.
- Citation hallucination is a critical risk that requires explicit verification mechanisms.
- Agent cost scales with task complexity; monitoring and limits are essential for production deployment.