Case Study 2: Evaluating and Improving RAG Quality

Chapter 31: Retrieval-Augmented Generation (RAG)


Overview

Company: MedAssist Health, a digital health startup providing AI-powered clinical decision support to 200 physicians. Challenge: The RAG system powering their clinical reference tool exhibited inconsistent answer quality -- high accuracy for drug interaction queries but poor performance for complex diagnostic reasoning and treatment guideline synthesis. Goal: Systematically evaluate RAG quality, identify failure modes, and implement targeted improvements to achieve 90%+ faithfulness and 85%+ answer relevance across all query types.


Problem Analysis

MedAssist's initial RAG system was built rapidly during a prototype phase. While physicians found it useful for simple lookups, several quality issues emerged after three months of production use:

  1. Hallucinated dosages: In 4% of responses, the system generated medication dosages not present in retrieved documents.
  2. Incomplete answers: Complex queries requiring synthesis across multiple guidelines received partial answers 23% of the time.
  3. Irrelevant retrieval: 18% of retrieved chunks were topically related but did not contain the information needed to answer the query.
  4. Contradictory sources: When retrieved chunks contained conflicting information (e.g., different editions of guidelines), the system did not flag the contradiction.

A manual audit of 500 queries revealed the following quality distribution:

Quality Level Definition Percentage
Excellent Correct, complete, well-cited 52%
Acceptable Correct but incomplete or poorly cited 26%
Poor Partially incorrect or missing key information 15%
Dangerous Contains hallucinated clinical information 7%

The 7% dangerous response rate was unacceptable for a clinical tool.


Evaluation Framework

RAGAS-Inspired Metrics

The team implemented a comprehensive evaluation framework measuring four dimensions:

  1. Faithfulness: Does the answer contain only information supported by retrieved context?

$$\text{Faithfulness} = \frac{\text{Number of claims supported by context}}{\text{Total number of claims in answer}}$$

  1. Answer Relevance: Does the answer address the question asked?

$$\text{Answer Relevance} = \text{mean}(\cos(\text{emb}(q), \text{emb}(a_i)))$$

where $a_i$ are synthetic questions generated from the answer and $q$ is the original query.

  1. Context Precision: Are the retrieved chunks relevant to the query?

$$\text{Context Precision} = \frac{1}{K} \sum_{k=1}^{K} \frac{\text{relevant chunks at rank} \leq k}{k} \cdot \mathbb{1}[\text{chunk}_k \text{ is relevant}]$$

  1. Context Recall: Does the retrieved context contain all information needed to answer?

$$\text{Context Recall} = \frac{\text{Number of ground truth claims covered by context}}{\text{Total ground truth claims}}$$

Evaluation Dataset

The team curated 300 query-answer pairs across three difficulty tiers:

Tier Description Count Example
Simple Single-document lookup 120 "What is the recommended dose of metformin for type 2 diabetes?"
Moderate Multi-document synthesis 120 "Compare first-line treatments for hypertension in patients with diabetes."
Complex Reasoning over guidelines 60 "A 65-year-old patient with CKD stage 3 and atrial fibrillation needs anticoagulation. What are the options and contraindications?"

Baseline Results

Metric Simple Moderate Complex Overall
Faithfulness 0.91 0.78 0.62 0.80
Answer Relevance 0.88 0.74 0.61 0.77
Context Precision 0.82 0.65 0.51 0.69
Context Recall 0.85 0.68 0.48 0.70

The results confirmed the pattern: good performance on simple queries but significant degradation on moderate and complex queries.


Improvement Iterations

Iteration 1: Improved Chunking

Problem identified: The original chunking strategy used fixed 512-token chunks, which frequently split clinical guidelines mid-paragraph, separating dosage information from its associated conditions.

Solution: Recursive chunking with section-aware splitting. Chunks respect heading boundaries and include the section hierarchy as metadata.

Metric Before After Delta
Context Precision 0.69 0.76 +0.07
Context Recall 0.70 0.77 +0.07

Iteration 2: Hybrid Search with Reranking

Problem identified: Dense retrieval missed queries containing specific drug names, ICD codes, and medical abbreviations that required exact lexical matching.

Solution: Hybrid search combining dense retrieval (weight 0.7) with BM25 keyword search (weight 0.3), followed by a cross-encoder reranker.

Metric Before After Delta
Context Precision 0.76 0.84 +0.08
Context Recall 0.77 0.83 +0.06

Iteration 3: Query Decomposition

Problem identified: Complex queries required information from multiple clinical topics (e.g., drug interactions + renal dosing + contraindications). A single embedding could not capture all facets.

Solution: For queries classified as complex, decompose into sub-queries using an LLM, retrieve for each sub-query independently, and merge results with deduplication.

Metric Simple Moderate Complex Overall
Context Recall 0.86 0.84 0.72 0.82

Iteration 4: Faithfulness Guardrails

Problem identified: Even with better retrieval, the LLM occasionally generated unsupported claims, especially about dosages and contraindications.

Solution: A post-generation verification step that extracts factual claims from the response and checks each against the retrieved context using NLI (natural language inference). Unsupported claims are flagged or removed.

Metric Before After Delta
Faithfulness 0.80 0.93 +0.13
Dangerous responses 7% 0.7% -6.3pp

Final Results

Metric Baseline Final Target Met?
Faithfulness 0.80 0.93 0.90 Yes
Answer Relevance 0.77 0.88 0.85 Yes
Context Precision 0.69 0.84 0.80 Yes
Context Recall 0.70 0.82 0.80 Yes
Dangerous responses 7.0% 0.7% < 1% Yes

Latency Impact

Component Latency (p50) Latency (p95)
Query decomposition 180 ms 350 ms
Hybrid retrieval 45 ms 120 ms
Reranking 85 ms 160 ms
LLM generation 1,200 ms 2,800 ms
Faithfulness check 250 ms 500 ms
Total 1,760 ms 3,930 ms

The additional latency from query decomposition and faithfulness checking was acceptable for the clinical use case, where correctness outweighs speed.


Key Lessons

  1. Evaluation before optimization. Without a structured evaluation framework, the team had been making changes based on anecdotal feedback. The RAGAS-inspired metrics provided objective guidance on where to focus effort.

  2. Chunking matters more than model choice. Switching from fixed to section-aware chunking produced a larger quality improvement than upgrading the embedding model from MiniLM to BGE-large.

  3. Hybrid search is essential for technical domains. Medical terminology, drug names, and codes require exact lexical matching that pure dense retrieval misses. The 70/30 dense/BM25 split was found through grid search.

  4. Query decomposition unlocks complex reasoning. The single largest improvement on complex queries came from breaking them into sub-queries. This is equivalent to giving the retriever multiple "searches" for a single user question.

  5. Post-generation verification is a safety requirement. For high-stakes domains, relying solely on the LLM to be faithful is insufficient. An independent verification step using NLI reduced dangerous responses by 90%.

  6. Evaluation is continuous, not one-time. The team established a weekly evaluation pipeline on a rotating subset of queries, catching regressions early as the document corpus evolved.


Code Reference

The complete implementation is available in code/case-study-code.py.