Case Study 1: Building a RAG System for Technical Documentation

Chapter 31: Retrieval-Augmented Generation (RAG)


Overview

Company: CloudScale Solutions, a mid-sized SaaS company with 500 engineers. Challenge: Engineers spend an average of 45 minutes per day searching internal documentation across Confluence, GitHub wikis, Notion pages, and README files. The information is scattered across 15,000+ documents totaling approximately 50 million tokens. Goal: Build a RAG-powered documentation assistant that reduces search time by 70% and provides accurate, cited answers.


Problem Analysis

CloudScale's documentation challenges were typical of growing engineering organizations:

  1. Fragmentation: Documentation lived in 4 different platforms with no unified search.
  2. Staleness: 30% of documents were outdated, leading to incorrect guidance.
  3. Discoverability: Engineers often did not know documentation existed for their question.
  4. Context loss: Search results returned entire pages; engineers had to read lengthy documents to find the relevant paragraph.

An initial survey revealed: - 78% of engineers reported difficulty finding relevant documentation. - 45% had followed outdated documentation that led to production issues. - Average time from question to answer: 45 minutes (including asking colleagues).


System Architecture

Design Decisions

The team evaluated several approaches before settling on their architecture:

Decision Options Considered Choice Rationale
Embedding model MiniLM, BGE, E5 BAAI/bge-base-en-v1.5 Best MTEB score in size class
Vector database FAISS, ChromaDB, Qdrant ChromaDB Simpler ops, metadata filtering
Chunking Fixed, recursive, semantic Recursive + headers Documents have clear structure
Search Dense only, hybrid Hybrid (dense + BM25) Technical terms need keyword match
Reranking None, cross-encoder bge-reranker-base Significant precision improvement
LLM GPT-4, Claude, Llama GPT-4 Turbo Best instruction following

Pipeline Architecture

Document Sources (Confluence, GitHub, Notion, READMEs)
    │
    ▼
Ingestion Service (Python + Celery)
    │
    ├── Document Fetcher (platform-specific adapters)
    ├── Parser (Markdown, HTML, PDF)
    ├── Chunker (recursive + header-aware)
    ├── Embedder (bge-base-en-v1.5)
    └── Indexer (ChromaDB + BM25)
    │
    ▼
Query Service (FastAPI)
    │
    ├── Query Processor (expansion, embedding)
    ├── Hybrid Retriever (dense + sparse + RRF)
    ├── Reranker (bge-reranker-base)
    ├── Prompt Builder (with citations)
    └── Generator (GPT-4 Turbo)
    │
    ▼
User Interface (Slack bot + Web UI)

Implementation Details

Phase 1: Document Ingestion

The team built platform-specific adapters for each documentation source:

"""Document ingestion pipeline for CloudScale RAG system."""

import hashlib
from dataclasses import dataclass, field
from typing import Optional

import torch
from sentence_transformers import SentenceTransformer

torch.manual_seed(42)


@dataclass
class DocumentChunk:
    """Represents a chunk of a document with metadata.

    Attributes:
        text: The chunk text content.
        source: Origin platform and document path.
        section: Section header for the chunk.
        chunk_id: Unique identifier for deduplication.
        metadata: Additional metadata for filtering.
    """

    text: str
    source: str
    section: str
    chunk_id: str = ""
    metadata: dict = field(default_factory=dict)

    def __post_init__(self) -> None:
        """Generate chunk_id from content hash if not provided."""
        if not self.chunk_id:
            self.chunk_id = hashlib.md5(
                f"{self.source}:{self.text[:100]}".encode()
            ).hexdigest()

The header-aware chunking strategy was critical for their structured documentation:

def chunk_markdown_document(
    text: str,
    source: str,
    max_chunk_size: int = 512,
    overlap: int = 50,
) -> list[DocumentChunk]:
    """Split a Markdown document into chunks respecting header structure.

    Args:
        text: The full document text in Markdown format.
        source: The source identifier for the document.
        max_chunk_size: Maximum number of tokens per chunk.
        overlap: Number of overlapping tokens between chunks.

    Returns:
        A list of DocumentChunk objects with section metadata.
    """
    sections = split_by_headers(text)
    chunks = []

    for section_title, section_text in sections:
        if count_tokens(section_text) <= max_chunk_size:
            chunks.append(DocumentChunk(
                text=section_text,
                source=source,
                section=section_title,
                metadata={
                    "type": "documentation",
                    "section": section_title,
                    "token_count": count_tokens(section_text),
                },
            ))
        else:
            # Recursive split within section
            sub_chunks = recursive_split(
                section_text, max_chunk_size, overlap
            )
            for i, sub_chunk in enumerate(sub_chunks):
                chunks.append(DocumentChunk(
                    text=sub_chunk,
                    source=source,
                    section=f"{section_title} (part {i + 1})",
                    metadata={
                        "type": "documentation",
                        "section": section_title,
                        "part": i + 1,
                        "token_count": count_tokens(sub_chunk),
                    },
                ))

    return chunks

The team implemented hybrid search combining dense retrieval with BM25:

"""Hybrid search combining dense and sparse retrieval with RRF."""

from dataclasses import dataclass

import chromadb
import torch
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer

torch.manual_seed(42)


@dataclass
class SearchResult:
    """A single search result with score and metadata.

    Attributes:
        chunk_id: Unique identifier for the chunk.
        text: The chunk text.
        score: Combined relevance score.
        source: Origin of the document.
        section: Section within the document.
    """

    chunk_id: str
    text: str
    score: float
    source: str
    section: str


def reciprocal_rank_fusion(
    ranked_lists: list[list[str]],
    k: int = 60,
) -> dict[str, float]:
    """Combine multiple ranked lists using Reciprocal Rank Fusion.

    Args:
        ranked_lists: List of ranked document ID lists.
        k: RRF constant (default 60).

    Returns:
        Dictionary mapping document IDs to fused scores.
    """
    fused_scores: dict[str, float] = {}
    for ranked_list in ranked_lists:
        for rank, doc_id in enumerate(ranked_list, start=1):
            if doc_id not in fused_scores:
                fused_scores[doc_id] = 0.0
            fused_scores[doc_id] += 1.0 / (k + rank)
    return fused_scores

Phase 3: Reranking and Generation

After hybrid retrieval, a cross-encoder reranker refined the results:

"""Reranking and response generation for the RAG pipeline."""

from sentence_transformers import CrossEncoder

torch.manual_seed(42)

SYSTEM_PROMPT = """You are CloudScale's documentation assistant. Answer questions
based ONLY on the provided documentation excerpts.

Rules:
1. Only use information from the provided context.
2. Cite sources using [Source: document_name, Section: section_name].
3. If the context doesn't contain the answer, say "I couldn't find this
   information in our documentation. You might want to check with the
   relevant team or create a documentation request."
4. If documentation appears outdated, flag it with a warning.
5. Be concise but complete.
"""


def rerank_results(
    query: str,
    results: list[SearchResult],
    model_name: str = "BAAI/bge-reranker-base",
    top_k: int = 5,
) -> list[SearchResult]:
    """Rerank search results using a cross-encoder model.

    Args:
        query: The user's search query.
        results: Initial retrieval results to rerank.
        model_name: Name of the cross-encoder model.
        top_k: Number of top results to return.

    Returns:
        Reranked list of top-k SearchResult objects.
    """
    reranker = CrossEncoder(model_name)
    pairs = [(query, result.text) for result in results]
    scores = reranker.predict(pairs)

    for result, score in zip(results, scores):
        result.score = float(score)

    reranked = sorted(results, key=lambda x: x.score, reverse=True)
    return reranked[:top_k]

Evaluation Results

The team conducted rigorous evaluation using 200 curated Q&A pairs:

Retrieval Quality

Configuration Recall@5 Precision@5 MRR
Dense only 0.72 0.58 0.65
BM25 only 0.68 0.52 0.59
Hybrid (RRF) 0.81 0.64 0.74
Hybrid + Reranking 0.81 0.78 0.82

Hybrid search with reranking improved MRR by 26% over dense-only retrieval.

Generation Quality

Using RAGAS framework on 200 test queries:

Metric Score
Faithfulness 0.91
Answer Relevance 0.87
Context Precision 0.83
Context Recall 0.79

Business Impact (After 3-Month Pilot)

Metric Before After Change
Avg. search time 45 min 12 min -73%
Incorrect guidance incidents 8/month 2/month -75%
Documentation requests filed 15/month 35/month +133%
Engineer satisfaction (1-5) 2.3 4.1 +78%

The increase in documentation requests was an unexpected positive outcome: the system surfaced gaps in documentation that engineers previously worked around.


Challenges and Solutions

Challenge 1: Code Blocks in Documentation

Technical documentation contained extensive code blocks that were poorly handled by standard text chunking.

Solution: Custom chunking that kept code blocks intact and added context headers:

def chunk_with_code_awareness(
    text: str,
    max_chunk_size: int = 512,
) -> list[str]:
    """Chunk text while keeping code blocks intact.

    Args:
        text: Document text potentially containing code blocks.
        max_chunk_size: Maximum tokens per chunk.

    Returns:
        List of text chunks with code blocks preserved.
    """
    segments = split_preserving_code_blocks(text)
    chunks = []
    current_chunk = ""

    for segment in segments:
        if is_code_block(segment):
            if count_tokens(segment) > max_chunk_size:
                # Large code block: keep with surrounding context
                if current_chunk:
                    chunks.append(current_chunk)
                    current_chunk = ""
                chunks.append(segment[:max_chunk_size * 4])  # chars approx
            elif count_tokens(current_chunk + segment) > max_chunk_size:
                chunks.append(current_chunk)
                current_chunk = segment
            else:
                current_chunk += "\n" + segment
        else:
            if count_tokens(current_chunk + segment) > max_chunk_size:
                chunks.append(current_chunk)
                current_chunk = segment
            else:
                current_chunk += "\n" + segment

    if current_chunk:
        chunks.append(current_chunk)

    return chunks

Challenge 2: Outdated Documentation

30% of documents were outdated, leading to incorrect answers.

Solution: - Added last_updated metadata and a staleness score. - The prompt instructed the LLM to warn about documents older than 6 months. - Implemented a "documentation health" dashboard showing staleness metrics.

Challenge 3: Cross-Platform Duplication

The same information existed in slightly different forms across platforms.

Solution: - Content deduplication using MinHash/LSH for near-duplicate detection. - When duplicates were found, the most recent version was prioritized. - A report was generated for teams to consolidate duplicates.


Lessons Learned

  1. Chunking quality matters more than embedding model quality. The team spent 2 weeks optimizing chunks and saw a larger improvement than switching from MiniLM to BGE.

  2. Hybrid search is almost always better than dense-only. Technical documentation contains specific terms (API names, error codes) that dense retrieval misses.

  3. Reranking is the highest-ROI improvement. Adding a cross-encoder reranker was a one-day implementation that improved precision by 22%.

  4. Evaluation drives improvement. Without the 200-question evaluation set, the team was unable to objectively compare configurations.

  5. User feedback is essential. The Slack bot's thumbs-up/thumbs-down feature identified failure modes that automated evaluation missed.

  6. Start with the simplest approach. The team initially tried to build agentic RAG with multi-step retrieval. The simpler pipeline with reranking performed better and was much easier to debug.


Production Configuration

# rag_config.yaml
embedding:
  model: "BAAI/bge-base-en-v1.5"
  batch_size: 64
  max_length: 512

chunking:
  strategy: "recursive_header_aware"
  max_chunk_size: 512
  overlap: 50
  preserve_code_blocks: true

retrieval:
  dense_top_k: 50
  bm25_top_k: 50
  rrf_k: 60
  hybrid_weight_dense: 0.6

reranking:
  model: "BAAI/bge-reranker-base"
  top_k: 5

generation:
  model: "gpt-4-turbo"
  temperature: 0.1
  max_tokens: 1024

ingestion:
  schedule: "every 6 hours"
  staleness_warning_days: 180
  deduplication: true

Cost Analysis (Monthly)

Component Cost
Embedding computation (re-indexing) $45
Vector storage (ChromaDB on EC2) $120
LLM API calls (~10K queries/month) $380
Reranker inference (GPU) $90
Infrastructure (API server, workers) $200
Total $835/month

At 500 engineers saving 33 minutes per day, the ROI was estimated at 275 engineer-hours saved per month, making the system highly cost-effective.