Case Study 1: Building a RAG System for Technical Documentation
Chapter 31: Retrieval-Augmented Generation (RAG)
Overview
Company: CloudScale Solutions, a mid-sized SaaS company with 500 engineers. Challenge: Engineers spend an average of 45 minutes per day searching internal documentation across Confluence, GitHub wikis, Notion pages, and README files. The information is scattered across 15,000+ documents totaling approximately 50 million tokens. Goal: Build a RAG-powered documentation assistant that reduces search time by 70% and provides accurate, cited answers.
Problem Analysis
CloudScale's documentation challenges were typical of growing engineering organizations:
- Fragmentation: Documentation lived in 4 different platforms with no unified search.
- Staleness: 30% of documents were outdated, leading to incorrect guidance.
- Discoverability: Engineers often did not know documentation existed for their question.
- Context loss: Search results returned entire pages; engineers had to read lengthy documents to find the relevant paragraph.
An initial survey revealed: - 78% of engineers reported difficulty finding relevant documentation. - 45% had followed outdated documentation that led to production issues. - Average time from question to answer: 45 minutes (including asking colleagues).
System Architecture
Design Decisions
The team evaluated several approaches before settling on their architecture:
| Decision | Options Considered | Choice | Rationale |
|---|---|---|---|
| Embedding model | MiniLM, BGE, E5 | BAAI/bge-base-en-v1.5 |
Best MTEB score in size class |
| Vector database | FAISS, ChromaDB, Qdrant | ChromaDB | Simpler ops, metadata filtering |
| Chunking | Fixed, recursive, semantic | Recursive + headers | Documents have clear structure |
| Search | Dense only, hybrid | Hybrid (dense + BM25) | Technical terms need keyword match |
| Reranking | None, cross-encoder | bge-reranker-base |
Significant precision improvement |
| LLM | GPT-4, Claude, Llama | GPT-4 Turbo | Best instruction following |
Pipeline Architecture
Document Sources (Confluence, GitHub, Notion, READMEs)
│
▼
Ingestion Service (Python + Celery)
│
├── Document Fetcher (platform-specific adapters)
├── Parser (Markdown, HTML, PDF)
├── Chunker (recursive + header-aware)
├── Embedder (bge-base-en-v1.5)
└── Indexer (ChromaDB + BM25)
│
▼
Query Service (FastAPI)
│
├── Query Processor (expansion, embedding)
├── Hybrid Retriever (dense + sparse + RRF)
├── Reranker (bge-reranker-base)
├── Prompt Builder (with citations)
└── Generator (GPT-4 Turbo)
│
▼
User Interface (Slack bot + Web UI)
Implementation Details
Phase 1: Document Ingestion
The team built platform-specific adapters for each documentation source:
"""Document ingestion pipeline for CloudScale RAG system."""
import hashlib
from dataclasses import dataclass, field
from typing import Optional
import torch
from sentence_transformers import SentenceTransformer
torch.manual_seed(42)
@dataclass
class DocumentChunk:
"""Represents a chunk of a document with metadata.
Attributes:
text: The chunk text content.
source: Origin platform and document path.
section: Section header for the chunk.
chunk_id: Unique identifier for deduplication.
metadata: Additional metadata for filtering.
"""
text: str
source: str
section: str
chunk_id: str = ""
metadata: dict = field(default_factory=dict)
def __post_init__(self) -> None:
"""Generate chunk_id from content hash if not provided."""
if not self.chunk_id:
self.chunk_id = hashlib.md5(
f"{self.source}:{self.text[:100]}".encode()
).hexdigest()
The header-aware chunking strategy was critical for their structured documentation:
def chunk_markdown_document(
text: str,
source: str,
max_chunk_size: int = 512,
overlap: int = 50,
) -> list[DocumentChunk]:
"""Split a Markdown document into chunks respecting header structure.
Args:
text: The full document text in Markdown format.
source: The source identifier for the document.
max_chunk_size: Maximum number of tokens per chunk.
overlap: Number of overlapping tokens between chunks.
Returns:
A list of DocumentChunk objects with section metadata.
"""
sections = split_by_headers(text)
chunks = []
for section_title, section_text in sections:
if count_tokens(section_text) <= max_chunk_size:
chunks.append(DocumentChunk(
text=section_text,
source=source,
section=section_title,
metadata={
"type": "documentation",
"section": section_title,
"token_count": count_tokens(section_text),
},
))
else:
# Recursive split within section
sub_chunks = recursive_split(
section_text, max_chunk_size, overlap
)
for i, sub_chunk in enumerate(sub_chunks):
chunks.append(DocumentChunk(
text=sub_chunk,
source=source,
section=f"{section_title} (part {i + 1})",
metadata={
"type": "documentation",
"section": section_title,
"part": i + 1,
"token_count": count_tokens(sub_chunk),
},
))
return chunks
Phase 2: Hybrid Search
The team implemented hybrid search combining dense retrieval with BM25:
"""Hybrid search combining dense and sparse retrieval with RRF."""
from dataclasses import dataclass
import chromadb
import torch
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
torch.manual_seed(42)
@dataclass
class SearchResult:
"""A single search result with score and metadata.
Attributes:
chunk_id: Unique identifier for the chunk.
text: The chunk text.
score: Combined relevance score.
source: Origin of the document.
section: Section within the document.
"""
chunk_id: str
text: str
score: float
source: str
section: str
def reciprocal_rank_fusion(
ranked_lists: list[list[str]],
k: int = 60,
) -> dict[str, float]:
"""Combine multiple ranked lists using Reciprocal Rank Fusion.
Args:
ranked_lists: List of ranked document ID lists.
k: RRF constant (default 60).
Returns:
Dictionary mapping document IDs to fused scores.
"""
fused_scores: dict[str, float] = {}
for ranked_list in ranked_lists:
for rank, doc_id in enumerate(ranked_list, start=1):
if doc_id not in fused_scores:
fused_scores[doc_id] = 0.0
fused_scores[doc_id] += 1.0 / (k + rank)
return fused_scores
Phase 3: Reranking and Generation
After hybrid retrieval, a cross-encoder reranker refined the results:
"""Reranking and response generation for the RAG pipeline."""
from sentence_transformers import CrossEncoder
torch.manual_seed(42)
SYSTEM_PROMPT = """You are CloudScale's documentation assistant. Answer questions
based ONLY on the provided documentation excerpts.
Rules:
1. Only use information from the provided context.
2. Cite sources using [Source: document_name, Section: section_name].
3. If the context doesn't contain the answer, say "I couldn't find this
information in our documentation. You might want to check with the
relevant team or create a documentation request."
4. If documentation appears outdated, flag it with a warning.
5. Be concise but complete.
"""
def rerank_results(
query: str,
results: list[SearchResult],
model_name: str = "BAAI/bge-reranker-base",
top_k: int = 5,
) -> list[SearchResult]:
"""Rerank search results using a cross-encoder model.
Args:
query: The user's search query.
results: Initial retrieval results to rerank.
model_name: Name of the cross-encoder model.
top_k: Number of top results to return.
Returns:
Reranked list of top-k SearchResult objects.
"""
reranker = CrossEncoder(model_name)
pairs = [(query, result.text) for result in results]
scores = reranker.predict(pairs)
for result, score in zip(results, scores):
result.score = float(score)
reranked = sorted(results, key=lambda x: x.score, reverse=True)
return reranked[:top_k]
Evaluation Results
The team conducted rigorous evaluation using 200 curated Q&A pairs:
Retrieval Quality
| Configuration | Recall@5 | Precision@5 | MRR |
|---|---|---|---|
| Dense only | 0.72 | 0.58 | 0.65 |
| BM25 only | 0.68 | 0.52 | 0.59 |
| Hybrid (RRF) | 0.81 | 0.64 | 0.74 |
| Hybrid + Reranking | 0.81 | 0.78 | 0.82 |
Hybrid search with reranking improved MRR by 26% over dense-only retrieval.
Generation Quality
Using RAGAS framework on 200 test queries:
| Metric | Score |
|---|---|
| Faithfulness | 0.91 |
| Answer Relevance | 0.87 |
| Context Precision | 0.83 |
| Context Recall | 0.79 |
Business Impact (After 3-Month Pilot)
| Metric | Before | After | Change |
|---|---|---|---|
| Avg. search time | 45 min | 12 min | -73% |
| Incorrect guidance incidents | 8/month | 2/month | -75% |
| Documentation requests filed | 15/month | 35/month | +133% |
| Engineer satisfaction (1-5) | 2.3 | 4.1 | +78% |
The increase in documentation requests was an unexpected positive outcome: the system surfaced gaps in documentation that engineers previously worked around.
Challenges and Solutions
Challenge 1: Code Blocks in Documentation
Technical documentation contained extensive code blocks that were poorly handled by standard text chunking.
Solution: Custom chunking that kept code blocks intact and added context headers:
def chunk_with_code_awareness(
text: str,
max_chunk_size: int = 512,
) -> list[str]:
"""Chunk text while keeping code blocks intact.
Args:
text: Document text potentially containing code blocks.
max_chunk_size: Maximum tokens per chunk.
Returns:
List of text chunks with code blocks preserved.
"""
segments = split_preserving_code_blocks(text)
chunks = []
current_chunk = ""
for segment in segments:
if is_code_block(segment):
if count_tokens(segment) > max_chunk_size:
# Large code block: keep with surrounding context
if current_chunk:
chunks.append(current_chunk)
current_chunk = ""
chunks.append(segment[:max_chunk_size * 4]) # chars approx
elif count_tokens(current_chunk + segment) > max_chunk_size:
chunks.append(current_chunk)
current_chunk = segment
else:
current_chunk += "\n" + segment
else:
if count_tokens(current_chunk + segment) > max_chunk_size:
chunks.append(current_chunk)
current_chunk = segment
else:
current_chunk += "\n" + segment
if current_chunk:
chunks.append(current_chunk)
return chunks
Challenge 2: Outdated Documentation
30% of documents were outdated, leading to incorrect answers.
Solution:
- Added last_updated metadata and a staleness score.
- The prompt instructed the LLM to warn about documents older than 6 months.
- Implemented a "documentation health" dashboard showing staleness metrics.
Challenge 3: Cross-Platform Duplication
The same information existed in slightly different forms across platforms.
Solution: - Content deduplication using MinHash/LSH for near-duplicate detection. - When duplicates were found, the most recent version was prioritized. - A report was generated for teams to consolidate duplicates.
Lessons Learned
-
Chunking quality matters more than embedding model quality. The team spent 2 weeks optimizing chunks and saw a larger improvement than switching from MiniLM to BGE.
-
Hybrid search is almost always better than dense-only. Technical documentation contains specific terms (API names, error codes) that dense retrieval misses.
-
Reranking is the highest-ROI improvement. Adding a cross-encoder reranker was a one-day implementation that improved precision by 22%.
-
Evaluation drives improvement. Without the 200-question evaluation set, the team was unable to objectively compare configurations.
-
User feedback is essential. The Slack bot's thumbs-up/thumbs-down feature identified failure modes that automated evaluation missed.
-
Start with the simplest approach. The team initially tried to build agentic RAG with multi-step retrieval. The simpler pipeline with reranking performed better and was much easier to debug.
Production Configuration
# rag_config.yaml
embedding:
model: "BAAI/bge-base-en-v1.5"
batch_size: 64
max_length: 512
chunking:
strategy: "recursive_header_aware"
max_chunk_size: 512
overlap: 50
preserve_code_blocks: true
retrieval:
dense_top_k: 50
bm25_top_k: 50
rrf_k: 60
hybrid_weight_dense: 0.6
reranking:
model: "BAAI/bge-reranker-base"
top_k: 5
generation:
model: "gpt-4-turbo"
temperature: 0.1
max_tokens: 1024
ingestion:
schedule: "every 6 hours"
staleness_warning_days: 180
deduplication: true
Cost Analysis (Monthly)
| Component | Cost |
|---|---|
| Embedding computation (re-indexing) | $45 |
| Vector storage (ChromaDB on EC2) | $120 |
| LLM API calls (~10K queries/month) | $380 |
| Reranker inference (GPU) | $90 |
| Infrastructure (API server, workers) | $200 |
| Total | $835/month |
At 500 engineers saving 33 minutes per day, the ROI was estimated at 275 engineer-hours saved per month, making the system highly cost-effective.