Chapter 31 Exercises: Retrieval-Augmented Generation (RAG)

Section 1: Embedding Models and Vector Representations

Exercise 31.1: Exploring Embedding Similarity

Write a Python script that uses sentence-transformers to embed the following five sentences and compute a 5x5 cosine similarity matrix. Visualize the matrix as a heatmap using matplotlib. Discuss which pairs have the highest and lowest similarity and why.

"The cat sat on the mat."
"A feline rested on the rug."
"Quantum computing uses qubits for parallel computation."
"The dog chased the ball in the park."
"A kitten was lying on the carpet."

Exercise 31.2: Embedding Model Comparison

Compare three embedding models (all-MiniLM-L6-v2, all-mpnet-base-v2, and BAAI/bge-small-en-v1.5) on a custom set of 20 query-document pairs. Measure retrieval accuracy (Recall@1, Recall@5) and embedding latency. Which model offers the best tradeoff for your use case?

Exercise 31.3: Dimensionality Reduction Analysis

Embed 100 diverse sentences using all-MiniLM-L6-v2 (384 dimensions). Apply PCA and t-SNE to reduce to 2D. Plot the results and color-code by topic. Does the embedding space cluster semantically similar sentences together? How does PCA compare to t-SNE for visualization?

Exercise 31.4: Embedding Normalization

Demonstrate the difference between cosine similarity and dot product similarity for normalized vs. unnormalized embeddings. Show that for L2-normalized embeddings, cosine similarity equals dot product. Why does this matter for FAISS index selection?

Exercise 31.5: Custom Embedding Fine-Tuning

Using the sentence-transformers training API, fine-tune all-MiniLM-L6-v2 on a small domain-specific dataset of 500 query-passage pairs. Compare retrieval performance before and after fine-tuning on a held-out test set. What is the minimum amount of training data needed to see improvement?

Section 2: Vector Databases and Indexing

Exercise 31.6: FAISS Index Comparison

Create a dataset of 100,000 random 768-dimensional vectors. Build three FAISS indexes: IndexFlatL2, IndexIVFFlat (with 100 centroids), and IndexIVFPQ (with 100 centroids, 8 sub-quantizers). Compare query time, memory usage, and recall@10 for 100 random queries against the flat index as ground truth.

Exercise 31.7: ChromaDB Collection Management

Build a ChromaDB collection with 1,000 text chunks from a public dataset (e.g., Wikipedia articles). Implement CRUD operations: add documents, query by similarity, update documents, and delete documents. Demonstrate metadata filtering (e.g., retrieve only documents from a specific category).

Exercise 31.8: HNSW Parameter Tuning

Using FAISS's IndexHNSWFlat, experiment with different values of M (number of connections per layer: 8, 16, 32, 64) and efConstruction (construction-time exploration factor: 40, 100, 200). Measure index build time, query time, and recall@10. Plot the recall-vs-latency tradeoff curve.

Exercise 31.9: Hybrid Index Design

Implement a hybrid storage system that uses FAISS for vector search and SQLite for metadata storage. The system should support: (a) adding documents with embeddings and metadata, (b) vector similarity search, (c) metadata-filtered search (e.g., "find similar documents from category X created after date Y").

Exercise 31.10: Scalability Analysis

Measure how query latency scales with dataset size for different FAISS index types. Start with 10,000 vectors and double repeatedly to 1,280,000. Plot query latency vs. dataset size for IndexFlatL2, IndexIVFFlat, and IndexHNSWFlat. At what scale does approximate search become necessary?

Section 3: Document Processing and Chunking

Exercise 31.11: Chunking Strategy Comparison

Take a 10,000-word document and chunk it using four strategies: (a) fixed-size (512 tokens, 50 overlap), (b) sentence-based, (c) paragraph-based, and (d) recursive character splitting. For each strategy, report the number of chunks, average chunk size, and evaluate retrieval quality on 10 test queries.

Exercise 31.12: Optimal Chunk Size Experiment

Using a Q&A dataset with known ground truth passages, experiment with chunk sizes of 128, 256, 512, 1024, and 2048 tokens. Measure Recall@5 for each configuration. Plot the results and identify the optimal chunk size. How does chunk size interact with the embedding model's maximum token length?

Exercise 31.13: Semantic Chunking Implementation

Implement a semantic chunking algorithm that: (a) splits text into sentences, (b) embeds each sentence, (c) computes cosine similarity between consecutive sentences, (d) identifies split points where similarity drops below a threshold. Compare the resulting chunks against fixed-size chunking on a retrieval task.

Exercise 31.14: Parent-Child Chunking

Implement the parent-child chunking pattern where small child chunks (200 tokens) are indexed for retrieval but parent chunks (1000 tokens) are returned to the LLM. Compare retrieval precision and generation quality against standard single-level chunking.

Exercise 31.15: Document Format Handling

Build a document processing pipeline that handles PDF, Markdown, and HTML formats. For each format, extract clean text while preserving structure (headers, lists, tables). Test on at least 3 documents per format and evaluate extraction quality.

Section 4: Retrieval Strategies

Exercise 31.16: Dense vs. Sparse Retrieval

Implement both dense retrieval (using sentence-transformers + FAISS) and sparse retrieval (using BM25 via rank_bm25). Compare performance on 50 queries, measuring Recall@5 and MRR. Identify query types where each approach excels.

Exercise 31.17: Hybrid Search Implementation

Implement Reciprocal Rank Fusion (RRF) to combine dense and sparse retrieval results. Use $k=60$ as the RRF constant. Evaluate whether hybrid search outperforms either individual approach on your test set. Experiment with different values of $k$ (20, 40, 60, 80, 100).

Exercise 31.18: Reranking Pipeline

Build a two-stage retrieval pipeline: (a) retrieve the top-50 documents using dense search, (b) rerank using cross-encoder/ms-marco-MiniLM-L-12-v2. Compare the top-5 results before and after reranking. Measure the improvement in Precision@5 and MRR.

Exercise 31.19: Maximum Marginal Relevance

Implement Maximum Marginal Relevance (MMR) for diverse retrieval: $$\text{MMR} = \arg\max_{d \in R \setminus S} \left[ \lambda \cdot \text{sim}(q, d) - (1 - \lambda) \cdot \max_{d' \in S} \text{sim}(d, d') \right]$$ Test with $\lambda$ values of 0.5, 0.7, and 0.9. Show how lower $\lambda$ values increase diversity at the cost of relevance.

Exercise 31.20: Multi-Index Retrieval

Build a system that retrieves from multiple ChromaDB collections (e.g., one for documentation, one for FAQ, one for code examples) and merges results using RRF. Demonstrate that multi-index retrieval provides better answers for queries that span multiple knowledge sources.

Section 5: Query Transformation and Advanced RAG

Exercise 31.21: HyDE Implementation

Implement the Hypothetical Document Embedding (HyDE) technique. For 20 test queries: (a) embed the raw query and retrieve top-5, (b) generate a hypothetical answer, embed it, and retrieve top-5. Compare Recall@5 for both approaches. When does HyDE help and when does it hurt?

Exercise 31.22: Multi-Query Generation

Use an LLM to generate 3 alternative queries for each of 15 original queries. Retrieve documents for all queries and combine using union + deduplication. Measure the increase in recall compared to single-query retrieval.

Exercise 31.23: Step-Back Prompting

Implement step-back prompting for 10 specific technical questions. Generate a more general version of each query, retrieve documents for both the original and step-back queries, and compare the quality of generated answers.

Exercise 31.24: Self-RAG Implementation

Build a simplified Self-RAG system that: (a) decides whether retrieval is needed, (b) retrieves documents if needed, (c) evaluates the relevance of retrieved documents, (d) generates a response, (e) evaluates whether the response is supported by the documents. Test on 10 queries, including 3 that don't require retrieval.

Exercise 31.25: Corrective RAG Pipeline

Implement a CRAG-inspired pipeline that: (a) retrieves documents, (b) uses an LLM to classify each document as "Correct," "Ambiguous," or "Incorrect," (c) if mostly incorrect, refines the query and re-retrieves. Test on queries where initial retrieval fails.

Section 6: RAG Evaluation

Exercise 31.26: Evaluation Dataset Construction

Create a RAG evaluation dataset with 30 entries, each containing: a question, a ground truth answer, and 3-5 relevant document passages. Include factual questions, analytical questions, comparison questions, and questions that should be refused (no relevant documents exist).

Exercise 31.27: Faithfulness Evaluation

Implement a faithfulness evaluator that: (a) extracts claims from a generated answer, (b) checks each claim against the retrieved context, (c) computes a faithfulness score. Test on 15 generated responses, including 5 that contain hallucinated information.

Exercise 31.28: RAGAS Implementation

Use the RAGAS framework to evaluate a RAG system on your evaluation dataset. Compute faithfulness, answer relevance, context precision, and context recall. Identify the weakest metric and propose improvements.

Exercise 31.29: A/B Testing Framework

Design and implement an A/B testing framework for RAG configurations. Compare two configurations (e.g., different chunk sizes, with/without reranking) on the same set of queries. Use statistical significance testing (paired t-test or bootstrap) to determine which is better.

Exercise 31.30: Human Evaluation Protocol

Design a human evaluation protocol for RAG responses. Define a rating rubric (1-5 scale) for relevance, accuracy, completeness, and clarity. Have 3 evaluators rate 20 responses. Compute inter-annotator agreement (Cohen's kappa) and report the results.

Section 7: Production RAG Systems

Exercise 31.31: FastAPI RAG Service

Build a complete RAG service using FastAPI with endpoints for: (a) document ingestion (POST /ingest), (b) querying (POST /query), (c) collection management (GET /collections, DELETE /collections/{id}). Include proper error handling, request validation, and response schemas.

Exercise 31.32: Caching Layer

Implement a caching layer for your RAG system that: (a) caches query embeddings, (b) caches retrieval results for identical queries, (c) caches LLM responses for identical query-context pairs. Measure the cache hit rate and latency improvement over 100 repeated and 100 unique queries.

Exercise 31.33: Monitoring Dashboard

Build a simple monitoring system that tracks: (a) query latency breakdown (embedding, retrieval, reranking, generation), (b) retrieval quality metrics (average similarity score, number of results above threshold), (c) usage statistics (queries per minute, token usage). Log to a JSON file and create summary statistics.

Exercise 31.34: Document Update Pipeline

Implement an incremental document update pipeline that: (a) detects when documents have changed, (b) re-chunks and re-embeds only changed documents, (c) updates the vector database without downtime, (d) maintains document version history. Test with a corpus where 10% of documents change.

Exercise 31.35: Error Handling and Fallbacks

Implement robust error handling for a RAG pipeline that gracefully handles: (a) vector database connection failures (retry with backoff), (b) empty retrieval results (fallback to broader search), (c) LLM API timeouts (cached or default responses), (d) malformed user input (input validation and sanitization).

Section 8: Integration and Synthesis

Exercise 31.36: End-to-End RAG Benchmark

Build a complete RAG system and benchmark it on a standardized dataset (e.g., Natural Questions or SQuAD adapted for RAG). Report end-to-end accuracy, latency, and cost per query. Compare at least two configurations (e.g., different embedding models or chunk sizes).

Exercise 31.37: RAG vs. Long Context

For a corpus of 50 documents (approximately 100K tokens total): (a) implement a RAG system that retrieves the top-5 chunks, (b) implement a long-context approach that puts all documents in the prompt. Compare answer quality, latency, and cost. When is each approach preferable?

Extend a text RAG system to handle images: (a) use a vision model to generate captions for images, (b) embed and index the captions alongside text chunks, (c) retrieve and display relevant images in responses. Test with a small corpus containing both text and images.

Exercise 31.39: RAG System Design Document

Write a technical design document for a RAG system serving 1,000 queries per minute over a corpus of 10 million documents. Address: embedding model selection, vector database choice, indexing strategy, caching strategy, scaling approach, estimated costs, and SLA targets.

Exercise 31.40: Comparative RAG Analysis

Implement three RAG variants: (a) naive RAG (basic retrieve-and-generate), (b) advanced RAG (with HyDE + reranking), (c) agentic RAG (with query decomposition and iterative retrieval). Compare all three on 20 queries of varying complexity. Present results in a table with metrics for retrieval quality, answer quality, latency, and cost.