Case Study 2: Evaluating and Improving RAG Quality
Chapter 31: Retrieval-Augmented Generation (RAG)
Overview
Company: MedAssist Health, a digital health startup providing AI-powered clinical decision support to 200 physicians. Challenge: The RAG system powering their clinical reference tool exhibited inconsistent answer quality -- high accuracy for drug interaction queries but poor performance for complex diagnostic reasoning and treatment guideline synthesis. Goal: Systematically evaluate RAG quality, identify failure modes, and implement targeted improvements to achieve 90%+ faithfulness and 85%+ answer relevance across all query types.
Problem Analysis
MedAssist's initial RAG system was built rapidly during a prototype phase. While physicians found it useful for simple lookups, several quality issues emerged after three months of production use:
- Hallucinated dosages: In 4% of responses, the system generated medication dosages not present in retrieved documents.
- Incomplete answers: Complex queries requiring synthesis across multiple guidelines received partial answers 23% of the time.
- Irrelevant retrieval: 18% of retrieved chunks were topically related but did not contain the information needed to answer the query.
- Contradictory sources: When retrieved chunks contained conflicting information (e.g., different editions of guidelines), the system did not flag the contradiction.
A manual audit of 500 queries revealed the following quality distribution:
| Quality Level | Definition | Percentage |
|---|---|---|
| Excellent | Correct, complete, well-cited | 52% |
| Acceptable | Correct but incomplete or poorly cited | 26% |
| Poor | Partially incorrect or missing key information | 15% |
| Dangerous | Contains hallucinated clinical information | 7% |
The 7% dangerous response rate was unacceptable for a clinical tool.
Evaluation Framework
RAGAS-Inspired Metrics
The team implemented a comprehensive evaluation framework measuring four dimensions:
- Faithfulness: Does the answer contain only information supported by retrieved context?
$$\text{Faithfulness} = \frac{\text{Number of claims supported by context}}{\text{Total number of claims in answer}}$$
- Answer Relevance: Does the answer address the question asked?
$$\text{Answer Relevance} = \text{mean}(\cos(\text{emb}(q), \text{emb}(a_i)))$$
where $a_i$ are synthetic questions generated from the answer and $q$ is the original query.
- Context Precision: Are the retrieved chunks relevant to the query?
$$\text{Context Precision} = \frac{1}{K} \sum_{k=1}^{K} \frac{\text{relevant chunks at rank} \leq k}{k} \cdot \mathbb{1}[\text{chunk}_k \text{ is relevant}]$$
- Context Recall: Does the retrieved context contain all information needed to answer?
$$\text{Context Recall} = \frac{\text{Number of ground truth claims covered by context}}{\text{Total ground truth claims}}$$
Evaluation Dataset
The team curated 300 query-answer pairs across three difficulty tiers:
| Tier | Description | Count | Example |
|---|---|---|---|
| Simple | Single-document lookup | 120 | "What is the recommended dose of metformin for type 2 diabetes?" |
| Moderate | Multi-document synthesis | 120 | "Compare first-line treatments for hypertension in patients with diabetes." |
| Complex | Reasoning over guidelines | 60 | "A 65-year-old patient with CKD stage 3 and atrial fibrillation needs anticoagulation. What are the options and contraindications?" |
Baseline Results
| Metric | Simple | Moderate | Complex | Overall |
|---|---|---|---|---|
| Faithfulness | 0.91 | 0.78 | 0.62 | 0.80 |
| Answer Relevance | 0.88 | 0.74 | 0.61 | 0.77 |
| Context Precision | 0.82 | 0.65 | 0.51 | 0.69 |
| Context Recall | 0.85 | 0.68 | 0.48 | 0.70 |
The results confirmed the pattern: good performance on simple queries but significant degradation on moderate and complex queries.
Improvement Iterations
Iteration 1: Improved Chunking
Problem identified: The original chunking strategy used fixed 512-token chunks, which frequently split clinical guidelines mid-paragraph, separating dosage information from its associated conditions.
Solution: Recursive chunking with section-aware splitting. Chunks respect heading boundaries and include the section hierarchy as metadata.
| Metric | Before | After | Delta |
|---|---|---|---|
| Context Precision | 0.69 | 0.76 | +0.07 |
| Context Recall | 0.70 | 0.77 | +0.07 |
Iteration 2: Hybrid Search with Reranking
Problem identified: Dense retrieval missed queries containing specific drug names, ICD codes, and medical abbreviations that required exact lexical matching.
Solution: Hybrid search combining dense retrieval (weight 0.7) with BM25 keyword search (weight 0.3), followed by a cross-encoder reranker.
| Metric | Before | After | Delta |
|---|---|---|---|
| Context Precision | 0.76 | 0.84 | +0.08 |
| Context Recall | 0.77 | 0.83 | +0.06 |
Iteration 3: Query Decomposition
Problem identified: Complex queries required information from multiple clinical topics (e.g., drug interactions + renal dosing + contraindications). A single embedding could not capture all facets.
Solution: For queries classified as complex, decompose into sub-queries using an LLM, retrieve for each sub-query independently, and merge results with deduplication.
| Metric | Simple | Moderate | Complex | Overall |
|---|---|---|---|---|
| Context Recall | 0.86 | 0.84 | 0.72 | 0.82 |
Iteration 4: Faithfulness Guardrails
Problem identified: Even with better retrieval, the LLM occasionally generated unsupported claims, especially about dosages and contraindications.
Solution: A post-generation verification step that extracts factual claims from the response and checks each against the retrieved context using NLI (natural language inference). Unsupported claims are flagged or removed.
| Metric | Before | After | Delta |
|---|---|---|---|
| Faithfulness | 0.80 | 0.93 | +0.13 |
| Dangerous responses | 7% | 0.7% | -6.3pp |
Final Results
| Metric | Baseline | Final | Target | Met? |
|---|---|---|---|---|
| Faithfulness | 0.80 | 0.93 | 0.90 | Yes |
| Answer Relevance | 0.77 | 0.88 | 0.85 | Yes |
| Context Precision | 0.69 | 0.84 | 0.80 | Yes |
| Context Recall | 0.70 | 0.82 | 0.80 | Yes |
| Dangerous responses | 7.0% | 0.7% | < 1% | Yes |
Latency Impact
| Component | Latency (p50) | Latency (p95) |
|---|---|---|
| Query decomposition | 180 ms | 350 ms |
| Hybrid retrieval | 45 ms | 120 ms |
| Reranking | 85 ms | 160 ms |
| LLM generation | 1,200 ms | 2,800 ms |
| Faithfulness check | 250 ms | 500 ms |
| Total | 1,760 ms | 3,930 ms |
The additional latency from query decomposition and faithfulness checking was acceptable for the clinical use case, where correctness outweighs speed.
Key Lessons
-
Evaluation before optimization. Without a structured evaluation framework, the team had been making changes based on anecdotal feedback. The RAGAS-inspired metrics provided objective guidance on where to focus effort.
-
Chunking matters more than model choice. Switching from fixed to section-aware chunking produced a larger quality improvement than upgrading the embedding model from MiniLM to BGE-large.
-
Hybrid search is essential for technical domains. Medical terminology, drug names, and codes require exact lexical matching that pure dense retrieval misses. The 70/30 dense/BM25 split was found through grid search.
-
Query decomposition unlocks complex reasoning. The single largest improvement on complex queries came from breaking them into sub-queries. This is equivalent to giving the retriever multiple "searches" for a single user question.
-
Post-generation verification is a safety requirement. For high-stakes domains, relying solely on the LLM to be faithful is insufficient. An independent verification step using NLI reduced dangerous responses by 90%.
-
Evaluation is continuous, not one-time. The team established a weekly evaluation pipeline on a rotating subset of queries, catching regressions early as the document corpus evolved.
Code Reference
The complete implementation is available in code/case-study-code.py.