39 min read

"The best way to give a language model accurate, up-to-date knowledge is not to memorize the entire internet—it's to teach it how to look things up."

In This Chapter

Part VI: AI Systems Engineering
31.1 Introduction: The Limits of Parametric Knowledge
31.2 The RAG Architecture
31.3 Embedding Models
31.4 Vector Databases
31.5 Document Processing and Chunking
31.6 Retrieval Strategies
31.7 Query Transformation
31.8 Advanced RAG Patterns
31.9 RAG Evaluation
31.10 Production RAG Considerations
31.11 Handling Edge Cases and Failure Modes
31.12 Chunking Deep Dive: Practical Considerations
31.13 End-to-End RAG Pipeline Example
31.14 Comparison with Alternatives
31.15 Common Pitfalls and Debugging RAG Systems
31.16 The Future of RAG
Summary
References

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 31: Retrieval-Augmented Generation (RAG)

Part VI: AI Systems Engineering

"The best way to give a language model accurate, up-to-date knowledge is not to memorize the entire internet—it's to teach it how to look things up."

31.1 Introduction: The Limits of Parametric Knowledge

Large language models (LLMs) are remarkable feats of engineering. They compress trillions of tokens of text into billions of parameters, creating systems that can reason, converse, and generate across an astonishing range of topics. Yet this very compression introduces fundamental limitations that no amount of scaling can fully resolve.

The parametric knowledge problem manifests in several critical ways:

Knowledge cutoff: Models are frozen at training time. A model trained in January 2024 knows nothing about events in February 2024.
Hallucination: When models lack knowledge, they often fabricate plausible-sounding but incorrect information rather than admitting ignorance.
Source attribution: Parametric knowledge is distributed across billions of weights—there is no mechanism to cite where a specific fact came from.
Domain specificity: Enterprise knowledge, proprietary documents, and niche domains are typically absent from training data.
Update cost: Retraining or fine-tuning a model to incorporate new information is expensive and slow.

Consider a concrete example. If you ask a general-purpose LLM about your company's internal HR policies, it will either refuse to answer or, worse, generate a confident but entirely fabricated response. The information simply does not exist in its parameters.

Retrieval-Augmented Generation (RAG) addresses these limitations by augmenting the LLM's parametric knowledge with non-parametric knowledge retrieved from external sources at inference time. Rather than relying solely on what the model "memorized" during training, RAG systems retrieve relevant documents and present them to the model as context, enabling it to generate grounded, verifiable, and up-to-date responses.

The concept was formalized by Lewis et al. (2020) in their paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," but the underlying idea—combining retrieval with generation—has roots going back to information retrieval research from decades earlier. The idea of augmenting language generation with retrieved passages traces a lineage through open-domain question answering systems like DrQA (Chen et al., 2017), which paired a TF-IDF retriever with a neural reader, and through the REALM framework (Guu et al., 2020), which jointly pre-trained a retriever and a masked language model. What makes modern RAG systems so powerful is the convergence of three technologies: high-quality embedding models, efficient vector databases, and capable instruction-following LLMs.

Historically, information retrieval and natural language generation were treated as separate disciplines. Search engines returned ranked lists of documents; language models generated text from scratch. RAG unifies these two traditions by treating retrieval as a differentiable (or at least pipelined) component within a generation system. This unification is not merely a convenience—it is architecturally significant. By decoupling what the model knows (stored in retrieved documents) from how the model reasons (encoded in its parameters), RAG systems achieve a separation of concerns that makes each component independently improvable and auditable.

Why RAG Matters in Practice

RAG has become the dominant paradigm for building knowledge-grounded AI applications for several compelling reasons:

$$\text{RAG Advantage} = \underbrace{\text{LLM Reasoning}}_{\text{parametric}} + \underbrace{\text{Retrieved Context}}_{\text{non-parametric}}$$

No retraining required: New documents can be indexed in minutes, not weeks.
Verifiable outputs: Every generated answer can be traced back to source documents.
Cost-effective: Retrieval is orders of magnitude cheaper than retraining.
Access control: Document-level permissions can control what information is available.
Freshness: The knowledge base can be updated continuously.

In this chapter, we will build RAG systems from the ground up, starting with embedding models and vector databases, progressing through document processing and retrieval strategies, and culminating in production-grade systems with advanced techniques like query transformation, hybrid search, and reranking.

31.2 The RAG Architecture

A RAG system consists of two fundamental phases: indexing (offline) and retrieval + generation (online). Understanding this architecture is essential before diving into individual components.

31.2.1 The Indexing Pipeline

The indexing pipeline processes raw documents into a searchable format:

Raw Documents → Parsing → Chunking → Embedding → Vector Database

Document parsing: Extract text from PDFs, HTML, Word documents, Markdown files, etc.
Chunking: Split documents into semantically meaningful segments of appropriate size.
Embedding: Convert each chunk into a dense vector representation using an embedding model.
Storage: Store embeddings and associated metadata in a vector database.

31.2.2 The Retrieval and Generation Pipeline

At query time, the system follows this flow:

User Query → Query Embedding → Vector Search → Retrieved Chunks → LLM Generation → Response

Query processing: Optionally transform or expand the user's query.
Embedding: Convert the query into the same vector space as the documents.
Retrieval: Find the most similar document chunks using vector similarity search.
Reranking (optional): Reorder retrieved chunks using a cross-encoder for better precision.
Prompt construction: Assemble retrieved context and the query into a prompt.
Generation: The LLM generates a response grounded in the retrieved context.
Post-processing: Optionally verify, format, or cite the response.

31.2.3 Mathematical Foundation

The core of RAG relies on dense retrieval, where both queries and documents are mapped to a shared vector space. Given a query $q$ and a corpus of documents $\mathcal{D} = \{d_1, d_2, \ldots, d_N\}$, we compute:

$$\text{score}(q, d_i) = \text{sim}(E_q(q), E_d(d_i))$$

where $E_q$ and $E_d$ are encoder functions (often the same model), and $\text{sim}$ is typically cosine similarity:

$$\cos(q, d) = \frac{q \cdot d}{\|q\| \|d\|}$$

The top-$k$ documents are then retrieved:

$$\mathcal{R}_k(q) = \text{top-}k_{d \in \mathcal{D}} \; \text{score}(q, d)$$

These retrieved documents are concatenated with the query to form the input to the generator:

$$\text{response} = \text{LLM}(\text{prompt}(q, \mathcal{R}_k(q)))$$

31.3 Embedding Models

Embedding models are the backbone of dense retrieval. They transform text into fixed-dimensional vectors such that semantically similar texts have similar vector representations.

31.3.1 How Text Embeddings Work

Modern text embedding models are typically based on transformer architectures. Given an input sequence of tokens $x_1, x_2, \ldots, x_n$, the model produces contextualized representations $h_1, h_2, \ldots, h_n$. These are then pooled into a single vector:

Mean pooling (most common): $$\mathbf{e} = \frac{1}{n} \sum_{i=1}^{n} h_i$$

CLS token pooling: $$\mathbf{e} = h_{\text{[CLS]}}$$

The resulting vector $\mathbf{e} \in \mathbb{R}^d$ (where $d$ is typically 384, 768, or 1024) represents the semantic content of the entire input.

31.3.2 Sentence Transformers

The sentence-transformers library provides the most accessible interface for working with embedding models. These models are trained using contrastive learning objectives that push similar texts closer together and dissimilar texts apart in the embedding space.

The intuition behind contrastive learning is geometric: we want the embedding space to be organized so that questions land near the passages that answer them, paraphrases land near each other, and unrelated texts are pushed far apart. The training process sculpts this geometry by repeatedly comparing pairs and triplets of texts.

The training objective for a pair $(q, d^+)$ of query and positive document, with in-batch negatives $\{d^-_j\}$, is the InfoNCE loss:

$$\mathcal{L} = -\log \frac{e^{\text{sim}(q, d^+) / \tau}}{\sum_{j} e^{\text{sim}(q, d_j) / \tau}}$$

where: - $q$ is the query embedding - $d^+$ is the positive (relevant) document embedding - $d_j$ ranges over the positive and all in-batch negatives - $\tau$ is a temperature parameter (typically 0.05-0.1) that controls how sharply the loss distinguishes between similar and dissimilar pairs

Worked example. Suppose we have a query $q$ = "How to train a neural network" with a positive document $d^+$ = "Neural network training involves forward and backward passes" and two negatives $d^-_1$ = "The weather forecast for Tuesday" and $d^-_2$ = "Cooking pasta recipes." If cosine similarities are $\text{sim}(q, d^+) = 0.85$, $\text{sim}(q, d^-_1) = 0.12$, $\text{sim}(q, d^-_2) = 0.08$, with $\tau = 0.07$, then the loss encourages the model to increase the similarity with the positive document relative to the negatives. The smaller $\tau$ is, the more aggressively the model focuses on hard negatives—pairs that are close in embedding space but should be distinguished.

Hard negative mining is critical for high-quality embeddings. Easy negatives (completely unrelated documents) provide little training signal because the model already distinguishes them. Hard negatives—documents that are superficially similar to the query but do not answer it—force the model to learn fine-grained distinctions. For example, for the query "What causes diabetes?", a hard negative might be "Diabetes treatment options," which shares many keywords but answers a different question.

31.3.3 Choosing an Embedding Model

The choice of embedding model significantly impacts RAG performance. Key considerations include:

Model	Dimensions	Max Tokens	MTEB Score	Speed
all-MiniLM-L6-v2	384	256	56.3	Very Fast
all-mpnet-base-v2	768	384	57.8	Fast
e5-large-v2	1024	512	62.2	Medium
bge-large-en-v1.5	1024	512	64.2	Medium
gte-large	1024	512	63.1	Medium
text-embedding-3-large	3072	8191	64.6	API

Practical guidelines: - For prototyping, all-MiniLM-L6-v2 offers the best speed-quality tradeoff. - For production, bge-large-en-v1.5 or similar MTEB top-performers are recommended. - For multilingual applications, use models like multilingual-e5-large. - Always evaluate on your specific domain—MTEB benchmarks may not reflect your use case.

31.3.4 Embedding Model Selection: A Deeper Comparison

Choosing the right embedding model requires understanding the tradeoffs along several axes. Let us examine the most important considerations in more detail.

Dimensionality. Higher-dimensional embeddings can represent more nuanced semantic distinctions but require proportionally more storage and computation. A 384-dimensional embedding uses half the storage of a 768-dimensional one, and vector search is correspondingly faster. For many RAG applications, 384-768 dimensions provide sufficient expressiveness. The diminishing returns of higher dimensions become apparent when we consider that doubling dimensionality from 768 to 1536 typically improves MTEB scores by only 1-2 points while doubling storage costs.

Context window. The maximum number of tokens an embedding model can process determines the maximum chunk size. Models like all-MiniLM-L6-v2 with a 256-token limit force smaller chunks, while text-embedding-3-large with its 8191-token window allows embedding entire pages. However, longer context windows do not always mean better embeddings—the model must be trained on appropriately long texts to use the window effectively.

Instruction-tuned vs. vanilla. Some embedding models (like the E5 and BGE families) support instruction prefixes that tell the model how to embed the text. For example, BGE models use "Represent this sentence for searching relevant passages:" as a query prefix and "Represent this document for retrieval:" as a document prefix. These instructions help the model produce embeddings optimized for the asymmetric retrieval task (short query vs. long document). Always check whether your chosen model requires specific prefixes.

Matryoshka embeddings. A recent innovation, Matryoshka Representation Learning (MRL), trains embedding models such that the first $k$ dimensions of the embedding are themselves a valid (though lower-quality) embedding. This means you can truncate a 1024-dimensional embedding to 256 dimensions for fast initial filtering, then use the full 1024 dimensions for reranking. Models like nomic-embed-text-v1.5 and OpenAI's text-embedding-3 family support this capability.

31.3.5 Fine-Tuning Embedding Models

When off-the-shelf embedding models underperform on your domain, fine-tuning can dramatically improve retrieval quality. The key challenge is obtaining training data: pairs (or triplets) of queries and relevant/irrelevant documents.

Strategies for generating training data: 1. Mining from logs: Use user queries and clicked documents. 2. LLM-generated queries: Given a document, ask an LLM to generate questions it answers. This is remarkably effective—prompt a model with "Generate 5 questions that this passage answers:" and you obtain a large, diverse training set at low cost. 3. Hard negative mining: Use the current retriever to find near-misses as negatives. 4. Cross-encoder distillation: Use a cross-encoder to score query-document pairs and distill this signal into the bi-encoder during fine-tuning.

Fine-tuning typically requires as few as 1,000-10,000 training pairs to produce meaningful improvements in domain-specific retrieval. The sentence-transformers library makes this straightforward:

from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader

model = SentenceTransformer("BAAI/bge-base-en-v1.5")

train_examples = [
    InputExample(texts=["What is our refund policy?",
                        "Customers may request a full refund within 30 days of purchase."]),
    InputExample(texts=["How do I reset my password?",
                        "Navigate to Settings > Account > Reset Password."]),
    # ... more domain-specific pairs
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
)

A practical tip: always hold out a test set of query-document pairs and measure Recall@5 and MRR before and after fine-tuning. Improvements of 10-30% in recall are common for domain-specific corpora.

31.4 Vector Databases

Vector databases are specialized storage systems optimized for storing, indexing, and querying high-dimensional vectors. They are the infrastructure that makes RAG systems scalable.

31.4.1 Approximate Nearest Neighbor (ANN) Search

Exact nearest neighbor search in high dimensions is computationally prohibitive—it requires comparing the query vector against every stored vector. For a database of $N$ vectors in $d$ dimensions, this takes $O(Nd)$ time. ANN algorithms trade a small amount of accuracy for dramatically faster search.

Hierarchical Navigable Small World (HNSW) graphs are the most popular ANN algorithm. They construct a multi-layer graph where: - The bottom layer contains all vectors. - Each higher layer contains a subset of vectors from the layer below. - Search starts at the top layer and greedily navigates to the nearest neighbor, then descends.

The search complexity is approximately $O(\log N)$, making it feasible to search billions of vectors in milliseconds.

Inverted File Index (IVF) partitions the vector space into Voronoi cells using k-means clustering. At query time, only vectors in the nearest cells are compared: - Training: Cluster $N$ vectors into $k$ centroids. - Query: Find the nearest $n_{\text{probe}}$ centroids, then search only those partitions.

Product Quantization (PQ) compresses vectors by splitting them into subvectors and quantizing each independently, dramatically reducing memory requirements.

31.4.2 FAISS

Facebook AI Similarity Search (FAISS) is a highly optimized library for dense vector similarity search. It supports CPU and GPU execution and offers a rich set of index types.

Key FAISS index types: - IndexFlatL2 / IndexFlatIP: Exact search (brute force). Best for small datasets. - IndexIVFFlat: IVF with flat storage. Good balance of speed and accuracy. - IndexIVFPQ: IVF with product quantization. Best for large datasets with memory constraints. - IndexHNSWFlat: HNSW graph. Excellent recall with fast search.

Choosing an index depends on your requirements:

$$\text{Index Choice} = f(\text{dataset size}, \text{memory budget}, \text{latency requirement}, \text{recall target})$$

31.4.3 ChromaDB

ChromaDB is an open-source embedding database designed specifically for AI applications. It provides a higher-level abstraction than FAISS, with built-in support for:

Persistent and in-memory storage
Metadata filtering
Document storage alongside embeddings
Automatic embedding generation
Simple Python and JavaScript APIs

ChromaDB is particularly well-suited for RAG applications because it stores documents, embeddings, and metadata together, eliminating the need to manage these separately.

import chromadb
from chromadb.utils import embedding_functions

# Initialize ChromaDB with persistence
client = chromadb.PersistentClient(path="./chroma_db")

# Use a sentence-transformers embedding function
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="BAAI/bge-base-en-v1.5"
)

# Create a collection
collection = client.get_or_create_collection(
    name="technical_docs",
    embedding_function=ef,
    metadata={"hnsw:space": "cosine"}  # Use cosine similarity
)

# Add documents with metadata
collection.add(
    documents=["RAG combines retrieval with generation...",
               "Vector databases store high-dimensional embeddings..."],
    metadatas=[{"source": "ch31", "topic": "rag"},
               {"source": "ch31", "topic": "vector_db"}],
    ids=["doc1", "doc2"]
)

# Query with metadata filtering
results = collection.query(
    query_texts=["How does retrieval-augmented generation work?"],
    n_results=5,
    where={"topic": "rag"}  # Filter by metadata
)

31.4.4 Vector Database Comparison

The vector database landscape is rapidly evolving. Let us compare the major options across dimensions that matter for production RAG systems:

Database	Type	ANN Algorithm	Filtering	Hybrid Search	Scalability	Best For
FAISS	Library	IVF, HNSW, PQ	Limited	No	Billion-scale	Research, embedded use
ChromaDB	Embedded DB	HNSW	Metadata	No	Million-scale	Prototyping, small-medium apps
Pinecone	Managed SaaS	Proprietary	Rich metadata	Yes	Billion-scale	Production, zero-ops
Weaviate	Self-hosted/Cloud	HNSW	GraphQL	Yes (BM25 + dense)	Billion-scale	Hybrid search applications
Qdrant	Self-hosted/Cloud	HNSW	Rich payload	Sparse vectors	Billion-scale	Complex filtering needs
Milvus	Self-hosted	IVF, HNSW, DiskANN	Attribute	Yes	Multi-billion	Large-scale enterprise
pgvector	PostgreSQL ext.	IVFFlat, HNSW	Full SQL	Via tsvector	Million-scale	Existing PostgreSQL infra

Pinecone offers a fully managed experience: you never manage infrastructure, indexes automatically scale, and the API is simple. The tradeoff is vendor lock-in and cost. Pinecone is an excellent choice when you need to move fast and your organization prefers managed services.

Weaviate stands out for its native hybrid search capability, combining BM25 and dense retrieval in a single query. It also supports multi-tenancy, making it suitable for SaaS applications where each customer's data must be isolated. The GraphQL API is powerful but has a learning curve.

Qdrant excels at complex filtering scenarios. Its payload indexing system allows you to build secondary indexes on metadata fields, enabling fast filtered queries like "find the most similar documents written after 2023 by author X in category Y." This makes it particularly useful for enterprise applications with complex access control requirements.

pgvector deserves special mention for teams already using PostgreSQL. Rather than introducing a new database, pgvector adds vector similarity search to your existing database. You get ACID transactions, familiar SQL queries, and the ability to join vector search results with relational data. The tradeoff is performance—pgvector is significantly slower than dedicated vector databases at scale, though the HNSW index (added in pgvector 0.5.0) has closed much of the gap.

For most projects, the choice between these is less important than proper chunking, embedding selection, and retrieval strategy. Start with ChromaDB for prototyping, then migrate to a production database when scaling requirements demand it.

31.5 Document Processing and Chunking

The quality of a RAG system is only as good as its document processing pipeline. Poor chunking can fragment important context, while overly large chunks dilute relevance and waste context window tokens.

31.5.1 Document Parsing

Before chunking, documents must be parsed from their native formats into clean text:

PDF: Use pymupdf, pdfplumber, or unstructured for complex layouts.
HTML: Use beautifulsoup4 with careful handling of navigation, ads, and boilerplate.
Word/PowerPoint: Use python-docx and python-pptx.
Markdown: Parse with structure-awareness to preserve headers and sections.
Code: Use language-specific parsers (e.g., tree-sitter) to respect code structure.

31.5.2 Chunking Strategies

Chunking is the process of dividing documents into segments suitable for embedding and retrieval. The optimal strategy depends on the document type and use case.

Fixed-size chunking splits text into chunks of a fixed number of characters or tokens, with optional overlap:

Chunk 1: tokens[0:512]
Chunk 2: tokens[448:960]   (with 64-token overlap)
Chunk 3: tokens[896:1408]
...

Overlap helps ensure that information at chunk boundaries is not lost. A common configuration is 512 tokens with 50-100 token overlap.

Recursive character splitting attempts to split text along natural boundaries (paragraphs, sentences, words) while respecting a maximum chunk size. The langchain library's RecursiveCharacterTextSplitter implements this with a hierarchy of separators: ["\n\n", "\n", ". ", " ", ""].

Semantic chunking uses embedding similarity to identify natural breakpoints. The intuition is that when a document shifts topics, consecutive sentences will have dissimilar embeddings. We can detect these topic shifts automatically:

$$\text{split at position } i \text{ if } \cos(\mathbf{e}_{s_i}, \mathbf{e}_{s_{i+1}}) < \theta$$

where: - $\mathbf{e}_{s_i}$ is the embedding of sentence $i$ - $\mathbf{e}_{s_{i+1}}$ is the embedding of sentence $i+1$ - $\theta$ is a similarity threshold (typically 0.5-0.75, tuned per domain)

Worked example. Consider a document with five sentences. We compute pairwise cosine similarities between consecutive sentences: $[0.89, 0.91, 0.42, 0.87]$. With $\theta = 0.6$, we split between sentences 3 and 4 (similarity 0.42), producing two chunks. The sharp drop in similarity signals a topic change—perhaps the document moved from describing a problem to discussing a solution.

A more robust variant uses a sliding window of $w$ sentences and computes the average similarity within the window, splitting where the similarity drops below a percentile threshold rather than an absolute value. This adapts to documents with varying levels of inter-sentence similarity.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
semantic_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=75,  # Split at 75th percentile drops
)
chunks = semantic_splitter.split_text(document_text)

Document-structure-aware chunking respects the inherent structure of documents: - Split by headers/sections in Markdown or HTML. - Split by paragraphs in prose. - Split by function/class boundaries in code. - Split by slide boundaries in presentations.

31.5.3 Chunk Size Optimization

The optimal chunk size involves a fundamental tradeoff:

Smaller Chunks	Larger Chunks
More precise retrieval	More context per chunk
Higher recall (find exact passages)	Lower noise from irrelevant sentences
More chunks to process	Fewer chunks to process
Risk losing context	Risk diluting relevance

Empirically, chunk sizes between 256 and 1024 tokens work well for most applications, with 512 tokens being a popular default. However, this should be tuned for your specific use case.

31.5.4 Metadata Enrichment

Each chunk should be stored with metadata that enables filtering and improves retrieval:

metadata = {
    "source": "technical_manual_v3.pdf",
    "page": 42,
    "section": "Troubleshooting",
    "subsection": "Network Connectivity",
    "chunk_index": 7,
    "total_chunks": 23,
    "date_ingested": "2024-03-15",
    "document_type": "manual",
    "version": "3.0"
}

Metadata enables powerful filtering at query time—for example, restricting search to a specific document version or section.

31.6 Retrieval Strategies

Retrieval is the critical bridge between the user's query and the knowledge base. The quality of retrieved documents directly determines the quality of the generated response.

31.6.1 Dense Retrieval

Dense retrieval uses learned embeddings to represent both queries and documents as dense vectors. Similarity is computed in the embedding space:

$$\text{score}_{\text{dense}}(q, d) = \cos(E(q), E(d))$$

Strengths: Captures semantic similarity, handles paraphrases and synonyms well. Weaknesses: Can miss exact keyword matches, requires quality embedding models, computationally expensive to index.

31.6.2 Sparse Retrieval (BM25)

BM25 is a classical information retrieval algorithm based on term frequency and inverse document frequency:

$$\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$$

where: - $f(t, d)$ is the term frequency of term $t$ in document $d$ - $\text{IDF}(t) = \log \frac{N - n(t) + 0.5}{n(t) + 0.5}$ - $k_1$ (typically 1.2-2.0) controls term frequency saturation - $b$ (typically 0.75) controls document length normalization - $\text{avgdl}$ is the average document length

Strengths: Excellent for exact keyword matching, no training required, interpretable. Weaknesses: Misses semantic similarity, no understanding of synonyms or paraphrases.

31.6.3 Hybrid Search

Hybrid search combines dense and sparse retrieval to capture both semantic similarity and keyword matching. This is often the best approach in practice because dense and sparse retrieval have complementary failure modes: dense retrieval misses exact keyword matches (e.g., searching for a specific error code "ERR_0x4F2A"), while sparse retrieval misses semantic paraphrases (e.g., "how to fix memory issues" should match a document about "addressing RAM overflow problems").

The combination can be done through several strategies:

Reciprocal Rank Fusion (RRF) is the most popular method because it requires no score normalization. The intuition is simple: if a document ranks highly in multiple retrieval systems, it is probably relevant. We sum the reciprocal ranks across all rankers:

$$\text{RRF}(d) = \sum_{r \in \mathcal{R}} \frac{1}{k + \text{rank}_r(d)}$$

where: - $\mathcal{R}$ is the set of rankers (e.g., dense retriever and BM25) - $\text{rank}_r(d)$ is the rank of document $d$ in ranker $r$ (starting from 1) - $k$ is a constant (typically 60) that dampens the contribution of high-rank documents

Worked example. Suppose document $d_A$ ranks #2 in dense retrieval and #5 in BM25, while document $d_B$ ranks #1 in dense and #50 in BM25. With $k = 60$: $\text{RRF}(d_A) = \frac{1}{62} + \frac{1}{65} = 0.0315$, and $\text{RRF}(d_B) = \frac{1}{61} + \frac{1}{110} = 0.0255$. Despite ranking #1 in dense search, $d_B$ scores lower overall because BM25 ranked it poorly—it may lack the specific keywords the user queried. Document $d_A$ wins because it is strong in both systems.

Linear combination normalizes scores and computes a weighted sum: $$\text{score}_{\text{hybrid}}(q, d) = \alpha \cdot \text{score}_{\text{dense}}(q, d) + (1 - \alpha) \cdot \text{score}_{\text{sparse}}(q, d)$$

where $\alpha$ is a tunable weight (typically 0.5-0.7 favoring dense retrieval). This requires normalizing scores to a common range (e.g., [0, 1]) because dense cosine similarities and BM25 scores are on different scales.

Practical tip: Start with RRF ($k = 60$) as your default hybrid strategy. It is parameter-free (no need to tune $\alpha$), robust to score distribution differences, and works well across a wide range of domains. Only switch to linear combination if you have a validation set to tune $\alpha$ on.

31.6.4 Reranking

Reranking uses a more powerful model to reorder the initial retrieval results. While retrievers use bi-encoders (independent encoding of query and document), rerankers use cross-encoders that jointly process the query-document pair:

$$\text{score}_{\text{rerank}}(q, d) = \text{CrossEncoder}([q; \text{SEP}; d])$$

where: - $q$ is the query text - $d$ is the document text - $[q; \text{SEP}; d]$ is the concatenation of query and document with a separator token - The output is a single relevance score (typically a logit or probability)

The intuition behind why cross-encoders outperform bi-encoders is simple: when encoding query and document independently, the bi-encoder cannot model interactions between them. A cross-encoder sees both simultaneously and can attend from query tokens to document tokens and vice versa, enabling fine-grained matching. For example, a bi-encoder might match "jaguar speed" to both a document about the animal and one about the car, while a cross-encoder can use contextual clues in the document to determine which meaning is relevant.

The typical pipeline is: 1. Retrieve top-100 documents using dense/sparse/hybrid search (fast, approximate). 2. Rerank the top-100 using a cross-encoder to get the top-5 (slow, precise).

This two-stage design is necessary because cross-encoders must process each query-document pair individually. For 100 candidates, this means 100 forward passes. For a million documents, it would mean a million forward passes—completely impractical for first-stage retrieval.

Popular reranking models include: - cross-encoder/ms-marco-MiniLM-L-12-v2 — Fast, good for prototyping. - BAAI/bge-reranker-v2-m3 — Strong multilingual reranker. - BAAI/bge-reranker-large — High accuracy, moderate speed. - Cohere Rerank API — Managed service, excellent quality, no infrastructure. - Jina Reranker — Open-weight reranker with strong performance.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-large", max_length=512)

# Rerank retrieved documents
query = "How does photosynthesis work?"
documents = [doc.page_content for doc in retrieved_docs]

# Score each query-document pair
pairs = [[query, doc] for doc in documents]
scores = reranker.predict(pairs)

# Sort by score (descending)
reranked = sorted(zip(scores, documents), key=lambda x: x[0], reverse=True)
top_5 = [doc for _, doc in reranked[:5]]

The reranking payoff. In practice, adding a reranker to a RAG pipeline typically improves answer quality by 5-15% as measured by faithfulness and answer relevance scores (as we will discuss in Section 31.9). The latency cost is 50-200ms for reranking 100 documents, which is acceptable for most applications.

31.6.5 Multi-Vector Retrieval (ColBERT)

ColBERT (Contextualized Late Interaction over BERT) represents documents as sets of token-level embeddings rather than a single vector. Similarity is computed via late interaction:

$$\text{score}(q, d) = \sum_{i \in q} \max_{j \in d} \; q_i \cdot d_j^T$$

This "MaxSim" operation allows fine-grained matching while maintaining the efficiency of pre-computing document embeddings. ColBERT achieves near-cross-encoder accuracy with bi-encoder-like efficiency.

31.7 Query Transformation

The user's raw query is often suboptimal for retrieval. Query transformation techniques improve retrieval quality by reformulating queries before they reach the retriever.

31.7.1 Query Expansion

Query expansion enriches the original query with additional terms or alternative phrasings to improve recall:

Original: "How do I fix the memory leak?"
Expanded: "How do I fix the memory leak? memory management garbage collection
           heap overflow resource disposal"

An LLM can generate expansions by being prompted: "Generate search queries that would help find information to answer this question."

31.7.2 Hypothetical Document Embedding (HyDE)

HyDE (Gao et al., 2022) is an elegant technique where the LLM generates a hypothetical answer to the query, and this hypothetical answer is embedded and used for retrieval instead of the original query.

The intuition is that a hypothetical answer, even if inaccurate, is more likely to be in the same embedding space neighborhood as actual relevant documents than the short query alone.

$$q_{\text{HyDE}} = E(\text{LLM}(\text{"Answer this question: "} + q))$$

31.7.3 Step-Back Prompting

For specific questions, a more general query may retrieve more relevant context:

Original: "What is the thermal conductivity of copper at 200K?"
Step-back: "What are the thermal properties of copper?"

The LLM generates a more abstract version of the query, retrieves broader context, and then uses it to answer the specific question.

31.7.4 Multi-Query Retrieval

Generate multiple perspectives on the same question, retrieve documents for each, and combine the results:

Original: "What are the benefits of microservices?"
Query 1: "advantages of microservice architecture"
Query 2: "why use microservices instead of monolith"
Query 3: "microservices scalability and maintainability"

This increases recall by covering different aspects of the information need.

31.8 Advanced RAG Patterns

Beyond basic retrieve-and-generate, several advanced patterns significantly improve RAG quality.

31.8.1 Multi-Step Retrieval (Iterative RAG)

For complex questions, a single retrieval step may not suffice. Multi-step RAG decomposes the question and retrieves information iteratively:

Decompose the complex question into sub-questions.
Retrieve and answer each sub-question.
Use intermediate answers to inform subsequent retrieval.
Synthesize a final answer from all gathered information.

This is particularly valuable for questions requiring information synthesis across multiple documents or topics.

31.8.2 Self-RAG (Self-Reflective RAG)

Self-RAG (Asai et al., 2023) trains the LLM to decide when to retrieve, evaluate retrieved passages, and critique its own generations:

Retrieve on demand: The model decides whether retrieval is needed for each segment.
Relevance evaluation: The model assesses whether retrieved passages are relevant.
Support evaluation: The model checks if the generation is supported by the passages.
Utility evaluation: The model rates the overall usefulness of the response.

31.8.3 Corrective RAG (CRAG)

CRAG adds a self-correction mechanism to the retrieval process:

Retrieve documents for the query.
Evaluate the quality of retrieved documents.
If documents are correct: proceed with generation.
If documents are ambiguous: refine the query and re-retrieve.
If documents are incorrect: fall back to web search or other sources.

31.8.4 Graph RAG

For knowledge that has complex relational structure, Graph RAG combines traditional RAG with knowledge graphs:

Build a knowledge graph from the document corpus using entity extraction and relationship identification.
Cluster entities into communities using graph algorithms (e.g., Leiden community detection).
Generate community summaries that capture the high-level themes of each cluster.
At query time, retrieve relevant graph substructures and community summaries.
Use graph context alongside text chunks for generation.

Microsoft's GraphRAG (Edge et al., 2024) builds community summaries of entity clusters, enabling the system to answer global questions that span the entire corpus—something traditional RAG struggles with. Consider the question "What are the major themes in this collection of research papers?" A traditional RAG system, which retrieves individual chunks, cannot synthesize a corpus-wide answer because no single chunk contains this information. Graph RAG can answer it because the community summaries capture cross-document themes.

The Graph RAG indexing pipeline operates in several stages:

Documents → Entity Extraction (LLM) → Relationship Extraction (LLM)
    → Knowledge Graph Construction → Community Detection (Leiden)
    → Community Summarization (LLM) → Index

When to use Graph RAG. Graph RAG is most valuable when your corpus has rich relational structure (e.g., characters in novels, entities in business documents, concepts in scientific papers) and when users ask global or comparative questions. For simple factual QA, traditional RAG is faster, cheaper, and usually sufficient. Graph RAG's main drawback is its indexing cost—it requires multiple LLM calls per document for entity and relationship extraction, making it significantly more expensive to build than a standard vector index.

31.8.5 Multi-Hop RAG

Some questions require synthesizing information from multiple documents in a chain of reasoning. For example: "What university did the inventor of the transformer architecture attend?" requires first finding that Ashish Vaswani is one of the key inventors, then finding which university he attended.

Multi-hop RAG addresses this through iterative retrieval:

Parse the question to identify the first information need ("Who invented the transformer?").
Retrieve documents for the first sub-question.
Extract the intermediate answer ("Ashish Vaswani et al.").
Reformulate the query with the new information ("What university did Ashish Vaswani attend?").
Retrieve documents for the refined query.
Synthesize a final answer from all retrieved evidence.

The key challenge is knowing when to stop. A practical approach is to set a maximum number of hops (typically 2-3) and use the LLM to determine whether sufficient information has been gathered at each step. As we will explore in Chapter 32, this iterative retrieval pattern closely resembles the agent reasoning loop.

31.8.6 Agentic RAG

Agentic RAG uses an LLM agent to orchestrate the retrieval process dynamically:

The agent decides which retrieval strategy to use.
It can query multiple knowledge bases.
It can refine queries based on initial results.
It determines when sufficient information has been gathered.

This provides maximum flexibility but introduces complexity and latency.

31.9 RAG Evaluation

Evaluating RAG systems is challenging because quality depends on multiple interacting components. A comprehensive evaluation framework must assess each component independently and the system as a whole.

31.9.1 Retrieval Evaluation Metrics

Recall@k: The fraction of relevant documents that appear in the top-$k$ results: $$\text{Recall@}k = \frac{|\text{Relevant} \cap \text{Retrieved@}k|}{|\text{Relevant}|}$$

Precision@k: The fraction of top-$k$ results that are relevant: $$\text{Precision@}k = \frac{|\text{Relevant} \cap \text{Retrieved@}k|}{k}$$

Mean Reciprocal Rank (MRR): The average reciprocal of the rank of the first relevant document: $$\text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}_q}$$

Normalized Discounted Cumulative Gain (nDCG): Measures ranking quality with graded relevance: $$\text{nDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k}, \quad \text{DCG@}k = \sum_{i=1}^{k} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)}$$

31.9.2 Generation Evaluation Metrics

Faithfulness (groundedness): Does the response only contain information supported by the retrieved context?

$$\text{Faithfulness} = \frac{\text{Number of claims supported by context}}{\text{Total number of claims in response}}$$

Answer relevance: Does the response actually address the user's question?

Context relevance: Are the retrieved documents relevant to the query?

Answer correctness: Is the response factually correct (compared to a ground truth)?

31.9.3 Automated Evaluation with LLMs

LLM-as-judge evaluation has become the standard approach for RAG assessment. Frameworks like RAGAS (Retrieval-Augmented Generation Assessment) use LLMs to compute:

Faithfulness score: Extract claims from the answer, verify each against the context. The LLM breaks the response into individual factual statements, then checks whether each is supported by the retrieved documents. The score is the fraction of supported claims.
Answer relevance score: Generate questions from the answer, compare similarity to the original query. If the generated questions are semantically similar to the original, the answer is relevant. This cleverly reverses the generation process to measure relevance without needing ground-truth answers.
Context precision: Evaluate whether relevant items are ranked higher in the retrieval results. This measures the quality of retrieval ranking—not just whether the right documents were retrieved, but whether they were retrieved first.
Context recall: Check if the ground truth can be attributed to the retrieved context. This measures whether the retrieval step found all necessary information.

Worked example with RAGAS. Suppose a user asks "What are the side effects of aspirin?" The RAG system retrieves three documents and generates: "Aspirin may cause stomach bleeding, nausea, and ringing in the ears." RAGAS evaluation proceeds as follows:

Faithfulness: Extract claims: (1) "Aspirin may cause stomach bleeding," (2) "Aspirin may cause nausea," (3) "Aspirin may cause ringing in the ears." Check each against retrieved context. If all three are supported: faithfulness = 3/3 = 1.0.
Answer relevance: Generate questions from the answer: "What are the side effects of aspirin?" Compare to original query—high similarity = high relevance.
Context precision: If document 1 (most relevant) was ranked first, precision is high.
Context recall: If the ground truth lists 4 side effects and the context covers 3, recall = 0.75.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

eval_data = Dataset.from_dict({
    "question": ["What are the side effects of aspirin?"],
    "answer": ["Aspirin may cause stomach bleeding, nausea, and ringing in the ears."],
    "contexts": [["Aspirin side effects include GI bleeding, nausea, and tinnitus..."]],
    "ground_truth": ["Common side effects include stomach bleeding, nausea, tinnitus, and allergic reactions."]
})

results = evaluate(
    eval_data,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(results)

31.9.4 Building an Evaluation Dataset

A robust evaluation dataset should include:

Diverse queries: Different types (factual, analytical, comparative, multi-hop).
Ground truth answers: Human-written reference answers.
Relevant documents: Annotated relevant passages for each query.
Edge cases: Queries that should be refused, out-of-scope queries, ambiguous queries.

A minimum of 50-100 evaluation examples is recommended, with 200+ for production systems.

31.10 Production RAG Considerations

Moving from a prototype to a production RAG system introduces numerous engineering challenges.

31.10.1 Scalability

Indexing throughput: Processing millions of documents requires parallel embedding computation, batch processing, and incremental updates.

Query latency: End-to-end latency should be under 2-3 seconds for a good user experience. This budget must be split across: - Query embedding: 10-50ms - Vector search: 5-50ms - Reranking (optional): 50-200ms - LLM generation: 500-2000ms

Storage: A collection of 10 million 768-dimensional float32 vectors requires approximately 30 GB of storage. Quantization can reduce this by 4-8x.

31.10.2 Document Lifecycle Management

Production systems must handle: - Incremental updates: Adding new documents without re-indexing everything. - Document deletion: Removing documents and all their associated chunks. - Version control: Managing multiple versions of the same document. - Staleness detection: Identifying and refreshing outdated content.

31.10.3 Access Control

In enterprise settings, not all users should see all documents. Implementing access control in RAG requires: - Storing permission metadata with each chunk. - Filtering search results based on the requesting user's permissions. - Ensuring no information leakage through chunk overlap or metadata.

31.10.4 Monitoring and Observability

Key metrics to monitor in production: - Retrieval hit rate: Are queries finding relevant documents? - Answer quality: User feedback, thumbs up/down. - Latency percentiles: p50, p95, p99 latency for each component. - Token usage: Cost of embedding and generation API calls. - Failure rate: Queries that result in errors or empty retrievals.

31.10.5 Cost Optimization

RAG costs are dominated by: 1. Embedding computation: One-time cost during indexing, per-query cost at inference. 2. Vector storage: Ongoing cost proportional to corpus size. 3. LLM generation: Per-query cost proportional to context size.

Strategies to reduce cost: - Cache frequent queries and their results. - Use smaller embedding models where quality is sufficient. - Reduce the number of retrieved chunks (5 instead of 10). - Use shorter, more focused prompts. - Implement tiered retrieval (cheap first-pass, expensive reranking only when needed).

31.10.6 Production RAG Pipeline Architecture

A production RAG system goes beyond the simple retrieve-and-generate loop. Here is a reference architecture that incorporates the best practices we have discussed:

                          ┌──────────────────┐
                          │   User Request    │
                          └────────┬─────────┘
                                   ▼
                          ┌──────────────────┐
                          │  Query Router     │ ── Classify intent, detect out-of-scope
                          └────────┬─────────┘
                                   ▼
                     ┌─────────────┴──────────────┐
                     ▼                            ▼
              ┌──────────────┐           ┌──────────────┐
              │ Query        │           │ Cache        │
              │ Transformer  │           │ Lookup       │
              │ (HyDE,       │           │              │
              │  multi-query)│           └──────┬───────┘
              └──────┬───────┘            Hit?  │
                     ▼                     ▼    ▼
              ┌──────────────┐        Yes: Return cached
              │ Hybrid       │        No:  Continue ──────┐
              │ Retriever    │                             │
              │ (Dense+BM25) │◄────────────────────────────┘
              └──────┬───────┘
                     ▼
              ┌──────────────┐
              │ Cross-Encoder│
              │ Reranker     │
              └──────┬───────┘
                     ▼
              ┌──────────────┐
              │ Context      │ ── Trim to budget, add metadata
              │ Assembly     │
              └──────┬───────┘
                     ▼
              ┌──────────────┐
              │ LLM          │ ── Generate with grounding instructions
              │ Generation   │
              └──────┬───────┘
                     ▼
              ┌──────────────┐
              │ Post-Process │ ── Citations, confidence, guardrails
              └──────┬───────┘
                     ▼
              ┌──────────────┐
              │ Response +   │
              │ Logging      │ ── Log for evaluation and monitoring
              └──────────────┘

Each component in this architecture is independently configurable and replaceable. The query router can be a simple classifier or an LLM call. The cache can be a semantic cache (using embedding similarity to match similar queries) or an exact-match cache. The retriever can be swapped between dense-only, sparse-only, or hybrid. This modularity is one of RAG's greatest engineering strengths.

31.10.7 Prompt Engineering for RAG

The prompt template significantly impacts RAG quality. Key principles:

Explicit grounding instruction: Tell the model to answer based only on the provided context.
Citation instruction: Ask the model to cite which documents support each claim.
Refusal instruction: Tell the model to say "I don't know" when the context doesn't contain the answer.
Format instruction: Specify the desired output format.

A production-quality RAG prompt:

You are a helpful assistant that answers questions based on the provided context.

Context:
{retrieved_documents}

Instructions:
- Answer the question based ONLY on the information in the context above.
- If the context does not contain enough information to answer, say "I don't have
  enough information to answer this question."
- Cite your sources using [Source: document_name] format.
- Be concise and specific.

Question: {user_question}

Answer:

31.11 Handling Edge Cases and Failure Modes

RAG systems must gracefully handle numerous failure modes:

31.11.1 No Relevant Documents Found

When retrieval returns no sufficiently relevant documents (all similarity scores below a threshold), the system should: - Acknowledge the limitation rather than hallucinate. - Suggest reformulating the query. - Optionally fall back to the LLM's parametric knowledge with a disclaimer.

31.11.2 Contradictory Information

When retrieved documents contain conflicting information: - Present both perspectives with their sources. - Prioritize more recent or authoritative sources. - Let the user decide which information to trust.

31.11.3 Multi-Hop Questions

Questions requiring information from multiple documents: - Use multi-step retrieval to gather all necessary information. - Implement chain-of-thought prompting to reason over multiple sources.

31.11.4 Out-of-Scope Questions

Questions that fall outside the knowledge base: - Implement intent classification to detect out-of-scope queries. - Provide a clear response indicating the question is outside the system's scope. - Optionally suggest alternative resources.

31.11.5 Long Documents

When a single relevant document exceeds the context window: - Use map-reduce: summarize each chunk, then summarize summaries. - Use refine: iteratively refine the answer as new chunks are processed. - Use the most relevant chunks based on sub-question decomposition.

31.12 Chunking Deep Dive: Practical Considerations

31.12.1 The Parent-Child Chunking Pattern

A powerful pattern is to index small chunks for precise retrieval but return larger parent chunks for context:

Split documents into large "parent" chunks (e.g., 2000 tokens).
Split each parent into small "child" chunks (e.g., 200 tokens).
Index the child chunks for retrieval.
When a child chunk is retrieved, return its parent chunk to the LLM.

This gives the best of both worlds: precise retrieval and rich context.

31.12.2 Overlapping Windows

Chunk overlap prevents information loss at boundaries:

$$\text{chunks}[i] = \text{tokens}[i \cdot (s - o) : i \cdot (s - o) + s]$$

where $s$ is the chunk size and $o$ is the overlap size. Typical overlap is 10-20% of chunk size.

31.12.3 Table and Image Handling

Tables and images require special treatment: - Tables: Extract as structured data (CSV/JSON), embed the textual representation, and store the original format for display. - Images: Use multimodal models to generate captions, embed the captions, and store image references. - Charts/Diagrams: Use vision models to extract data and descriptions.

31.13 End-to-End RAG Pipeline Example

Let us walk through building a complete RAG system, tying together all the concepts we have discussed.

Step 1: Document Ingestion and Chunking

We begin by loading documents, splitting them into chunks, and enriching each chunk with metadata. The recursive character text splitter respects natural boundaries such as paragraphs and sentences while maintaining a target chunk size.

from langchain_community.document_loaders import DirectoryLoader, PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load documents from a directory
loader = DirectoryLoader(
    "./documents/",
    glob="**/*.pdf",
    loader_cls=PyMuPDFLoader,
    show_progress=True,
)
documents = loader.load()

# Split into chunks with overlap
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)
chunks = text_splitter.split_documents(documents)
print(f"Loaded {len(documents)} documents, split into {len(chunks)} chunks")

Step 2: Embedding and Indexing

Each chunk is embedded using a sentence transformer model and stored in ChromaDB along with its text and metadata. The embedding model transforms each chunk into a dense vector that captures its semantic meaning.

from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

# Initialize the embedding model
embedding_model = HuggingFaceEmbeddings(
    model_name="BAAI/bge-base-en-v1.5",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},
)

# Create and persist the vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="./chroma_db",
    collection_name="my_rag_collection",
)

Step 3: Query Processing and Retrieval with Hybrid Search and Reranking

When a user submits a query, we embed the query using the same model, search the vector database for similar chunks, and optionally rerank the results using a cross-encoder for higher precision. Here we show a complete retrieval pipeline with BM25 hybrid search and cross-encoder reranking:

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from sentence_transformers import CrossEncoder

# Dense retriever (vector search)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 50})

# Sparse retriever (BM25)
bm25_retriever = BM25Retriever.from_documents(chunks, k=50)

# Hybrid retriever (RRF fusion)
hybrid_retriever = EnsembleRetriever(
    retrievers=[dense_retriever, bm25_retriever],
    weights=[0.6, 0.4],  # Favor dense retrieval
)

# Cross-encoder reranker
reranker = CrossEncoder("BAAI/bge-reranker-large", max_length=512)

def retrieve_and_rerank(query: str, top_k: int = 5) -> list:
    # Stage 1: Hybrid retrieval (fast)
    candidates = hybrid_retriever.invoke(query)

    # Stage 2: Cross-encoder reranking (precise)
    pairs = [[query, doc.page_content] for doc in candidates]
    scores = reranker.predict(pairs)

    # Sort by reranker score
    scored_docs = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)
    return [doc for _, doc in scored_docs[:top_k]]

Step 4: Prompt Construction and Generation

The retrieved chunks are assembled into a prompt with clear grounding instructions, and the LLM generates a response based on the provided context.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o", temperature=0)

rag_prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant. Answer the question based ONLY on the context below.
If the context does not contain enough information, say "I don't have enough
information to answer this question." Cite your sources using [Source: filename].

Context:
{context}

Question: {question}

Answer:""")

def generate_answer(query: str) -> str:
    # Retrieve and rerank
    top_docs = retrieve_and_rerank(query, top_k=5)

    # Build context string with source attribution
    context = "\n\n---\n\n".join(
        f"[Source: {doc.metadata.get('source', 'unknown')}]\n{doc.page_content}"
        for doc in top_docs
    )

    # Generate response
    chain = rag_prompt | llm
    response = chain.invoke({"context": context, "question": query})
    return response.content

Step 5: Post-Processing and Evaluation

The response is formatted with source citations, confidence scores, and any necessary disclaimers about information completeness. In a production system, you would also log the query, retrieved documents, and response for monitoring and evaluation.

def rag_pipeline(query: str) -> dict:
    top_docs = retrieve_and_rerank(query, top_k=5)
    answer = generate_answer(query)

    return {
        "answer": answer,
        "sources": [doc.metadata.get("source", "unknown") for doc in top_docs],
        "num_chunks_retrieved": len(top_docs),
        "top_similarity_score": float(
            reranker.predict([[query, top_docs[0].page_content]])[0]
        ) if top_docs else 0.0,
    }

This pipeline can be extended with query transformation (HyDE, multi-query), multi-step retrieval, caching, and other advanced techniques as we discussed in previous sections. The key architectural insight is that each component—embedding, retrieval, reranking, generation—can be independently improved and swapped without changing the overall pipeline structure.

31.14 Comparison with Alternatives

RAG is not the only approach to giving LLMs domain-specific knowledge. Understanding the alternatives helps you choose the right tool:

31.14.1 RAG vs. Fine-Tuning

Aspect	RAG	Fine-Tuning
Knowledge updates	Minutes (re-index)	Hours-days (retrain)
Source attribution	Yes (by design)	No
Hallucination control	Better (grounded)	Worse (baked in)
Cost to update	Low	High
Behavior modification	Limited	Strong
Domain adaptation	Good for facts	Good for style/behavior

When to use RAG: Factual QA, documentation search, knowledge-intensive tasks. When to use fine-tuning: Changing model behavior, learning domain-specific language, task-specific formatting. When to use both: Fine-tune for behavior, RAG for knowledge.

31.14.2 RAG vs. Long Context Windows

Modern LLMs support context windows of 100K+ tokens. Why not just dump all documents into the context?

Cost: Processing 100K tokens per query is expensive ($0.10-1.00+ per query).
Accuracy: LLMs struggle to find specific information in very long contexts ("lost in the middle" problem).
Scalability: 100K tokens is roughly 75K words—many knowledge bases are far larger.
Latency: Processing long contexts is slow.

RAG remains more practical for most applications, though long context windows are useful as a complement (e.g., stuffing more retrieved chunks).

31.15 Common Pitfalls and Debugging RAG Systems

Building a working RAG prototype is straightforward. Making it reliable in production is where the real engineering challenge lies. Here are the most common failure modes and how to diagnose them.

31.15.1 Poor Retrieval Quality

Symptom: The LLM generates incorrect or incomplete answers despite the information existing in the knowledge base.

Diagnosis: Inspect the retrieved chunks for each failing query. If the relevant chunk is not in the top-k, the problem is retrieval. Common causes:

Embedding model mismatch: The embedding model does not understand domain-specific terminology. Solution: fine-tune the embedding model (Section 31.3.5) or add domain terms to the query via expansion.
Chunk size too large: Important details are diluted by surrounding irrelevant text. Solution: reduce chunk size or use the parent-child pattern (Section 31.12.1).
Missing metadata filters: The correct document exists but is buried under results from other documents. Solution: use metadata filtering to narrow the search space.
Query-document asymmetry: The user's query uses different terminology than the documents. Solution: use hybrid search (Section 31.6.3) to capture both semantic and keyword matches.

31.15.2 Hallucination Despite Retrieved Context

Symptom: The LLM ignores the retrieved context and generates information from its parametric knowledge.

Diagnosis: Compare the generated response with the retrieved chunks. If the response contains claims not present in any chunk, the model is hallucinating. Common causes:

Weak grounding instructions: The prompt does not strongly enough instruct the model to use only the provided context. Solution: strengthen the grounding prompt (Section 31.10.7).
Irrelevant context: All retrieved chunks are irrelevant, so the model falls back to parametric knowledge. Solution: improve retrieval or implement a relevance threshold.
Context too long: The relevant information is buried in the middle of a long context (the "lost in the middle" phenomenon). Solution: place the most relevant chunks first, or reduce the number of chunks.

31.15.3 Systematic Debugging Workflow

For any RAG quality issue, follow this systematic debugging workflow:

Isolate the component: Is the problem retrieval (wrong documents) or generation (wrong answer from right documents)?
Inspect retrieval: For failing queries, manually examine the top-10 retrieved chunks and their scores. Is the relevant chunk present? What rank?
Test with oracle retrieval: Manually insert the correct chunks into the prompt. Does the LLM generate the right answer? If yes, the problem is retrieval. If no, the problem is generation.
Ablate components: Try removing reranking, changing chunk size, switching embedding models, adjusting the prompt. Measure the impact of each change.
Build regression tests: For every bug you fix, add the failing query to your evaluation set to prevent regressions.

31.16 The Future of RAG

RAG is evolving rapidly. Key trends include:

Multimodal RAG: Retrieving and reasoning over images, tables, and other modalities alongside text. Vision-language models can now generate embeddings for images, enabling retrieval of diagrams, charts, and photographs alongside text documents.
Learned retrieval: End-to-end training of the retriever and generator together, so that the retriever learns to fetch exactly what the generator needs rather than what seems generically relevant.
Structured RAG: Retrieving from structured data (SQL databases, knowledge graphs) alongside unstructured text. Text-to-SQL systems combined with RAG enable answering questions that require both factual knowledge and quantitative data.
Personalized RAG: Adapting retrieval and generation based on user preferences and history. As we will explore in Chapter 32, agent memory systems can track user preferences to improve RAG relevance over time.
Efficient indexing: Better compression, quantization, and streaming approaches for embedding storage. As discussed in Chapter 33, quantization techniques apply not only to model weights but also to embedding vectors.
Evaluation maturity: More standardized benchmarks and evaluation frameworks, moving beyond RAGAS to comprehensive evaluation suites that test robustness, fairness, and cross-lingual performance.

Summary

Retrieval-Augmented Generation bridges the gap between the reasoning capabilities of large language models and the vast, ever-changing landscape of domain-specific knowledge. By combining dense retrieval with generative AI, RAG systems can deliver accurate, verifiable, and up-to-date responses without the cost and complexity of model retraining.

The key components of a production RAG system are:

Embedding models that capture semantic meaning in dense vector representations.
Vector databases that enable fast similarity search over millions of documents.
Document processing pipelines that chunk and enrich documents for optimal retrieval.
Retrieval strategies that combine dense, sparse, and hybrid approaches with reranking.
Query transformation techniques that improve retrieval quality.
Evaluation frameworks that measure faithfulness, relevance, and answer quality.
Production engineering practices for scalability, monitoring, and cost optimization.

The field is moving fast, but the fundamental principles—semantic search, grounded generation, and systematic evaluation—will remain relevant regardless of which specific models and tools emerge next.

References

Lewis, P., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS.
Karpukhin, V., et al. (2020). "Dense Passage Retrieval for Open-Domain Question Answering." EMNLP.
Gao, L., et al. (2022). "Precise Zero-Shot Dense Retrieval without Relevance Labels." ACL.
Asai, A., et al. (2023). "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." ICLR.
Khattab, O., & Zaharia, M. (2020). "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT." SIGIR.
Es, S., et al. (2023). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." arXiv.
Edge, D., et al. (2024). "From Local to Global: A Graph RAG Approach to Query-Focused Summarization." arXiv.
Johnson, J., Douze, M., & Jégou, H. (2019). "Billion-scale similarity search with GPUs." IEEE Transactions on Big Data.
Reimers, N., & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP.
Robertson, S., & Zaragoza, H. (2009). "The Probabilistic Relevance Framework: BM25 and Beyond." Foundations and Trends in IR.