Chapter 31 Quiz: Retrieval-Augmented Generation (RAG)

Instructions

Choose the best answer for each question. Each question has exactly one correct answer unless otherwise specified.

Question 1

What is the primary motivation for using RAG instead of relying solely on an LLM's parametric knowledge?

A) RAG systems are always faster than standalone LLMs B) RAG provides access to up-to-date, verifiable, and domain-specific knowledge without retraining C) RAG eliminates the need for large language models entirely D) RAG reduces the size of the language model required

Answer: B Explanation: RAG's primary advantage is augmenting the LLM with non-parametric knowledge that can be updated without retraining, provides source attribution, and includes domain-specific information absent from training data. RAG systems are not necessarily faster (they add retrieval latency) and still require an LLM for generation.

Question 2

In the RAG architecture, which of the following correctly describes the indexing pipeline order?

A) Embedding → Chunking → Parsing → Storage B) Parsing → Embedding → Chunking → Storage C) Parsing → Chunking → Embedding → Storage D) Chunking → Parsing → Storage → Embedding

Answer: C Explanation: Documents must first be parsed from their native formats, then chunked into appropriate segments, then embedded into vector representations, and finally stored in a vector database.

Question 3

What is the typical output of mean pooling in a sentence transformer?

A) A sequence of token-level embeddings B) A single vector representing the entire input's semantic meaning C) A probability distribution over the vocabulary D) A binary classification label

Answer: B Explanation: Mean pooling averages all token-level representations to produce a single fixed-dimensional vector that represents the semantic content of the entire input sequence.

Question 4

Which FAISS index type provides exact nearest neighbor search?

A) IndexIVFFlat B) IndexHNSWFlat C) IndexFlatL2 D) IndexIVFPQ

Answer: C Explanation: IndexFlatL2 performs brute-force search comparing the query against every stored vector, providing exact nearest neighbors. All other listed index types are approximate nearest neighbor algorithms that trade some accuracy for speed.

Question 5

What is the primary advantage of HNSW (Hierarchical Navigable Small World) graphs for vector search?

A) They require the least memory of any ANN algorithm B) They provide exact results with no approximation C) They achieve approximately O(log N) search complexity with high recall D) They work only with low-dimensional vectors

Answer: C Explanation: HNSW constructs a multi-layer graph that enables greedy navigation from coarse to fine granularity, achieving O(log N) search complexity while maintaining high recall rates.

Question 6

In the BM25 scoring formula, what does the parameter b control?

A) Term frequency saturation B) Document length normalization C) Inverse document frequency weighting D) Query term boosting

Answer: B Explanation: The parameter b (typically 0.75) controls how much document length affects the score. When b=1, full length normalization is applied; when b=0, no length normalization occurs.

Question 7

What is Reciprocal Rank Fusion (RRF) used for in hybrid search?

A) Training embedding models on both dense and sparse features B) Combining ranked results from multiple retrieval methods into a single ranking C) Converting sparse vectors into dense vectors D) Computing BM25 scores using neural networks

Answer: B Explanation: RRF is a score fusion technique that combines results from multiple rankers (e.g., dense and sparse retrieval) into a single unified ranking using reciprocal ranks, with the formula RRF(d) = sum(1/(k + rank_r(d))).

Question 8

Why are cross-encoders more accurate than bi-encoders for ranking, yet not used for first-stage retrieval?

A) Cross-encoders produce lower-quality embeddings B) Cross-encoders jointly process query-document pairs, making them too slow for comparing against millions of documents C) Cross-encoders can only handle short documents D) Cross-encoders require supervised training data that is rarely available

Answer: B Explanation: Cross-encoders process the query and document together as a single input, enabling fine-grained token-level interactions. However, this means each query-document pair must be processed independently, making it computationally infeasible for large-scale first-stage retrieval.

Question 9

What is Hypothetical Document Embedding (HyDE)?

A) Embedding documents that do not yet exist in the corpus B) Using an LLM to generate a hypothetical answer, then using its embedding for retrieval instead of the query embedding C) Creating synthetic training data for embedding model fine-tuning D) Reducing embedding dimensions through hypothesis testing

Answer: B Explanation: HyDE generates a hypothetical answer to the query using an LLM, embeds this answer, and uses it for retrieval. The intuition is that the hypothetical answer is closer in embedding space to relevant documents than the short query alone.

Question 10

In the parent-child chunking pattern, what is indexed for retrieval versus what is sent to the LLM?

A) Parent chunks are indexed; child chunks are sent to the LLM B) Child chunks are indexed; parent chunks are sent to the LLM C) Both parent and child chunks are indexed; only parents go to the LLM D) Only parent chunks are used for both indexing and generation

Answer: B Explanation: Small child chunks are indexed for precise retrieval (they more closely match specific queries), but when a child is retrieved, the larger parent chunk is sent to the LLM to provide richer context for generation.

Question 11

What chunk size range is generally recommended as a starting point for most RAG applications?

A) 32-64 tokens B) 256-1024 tokens C) 2048-4096 tokens D) 8192-16384 tokens

Answer: B Explanation: Chunk sizes between 256 and 1024 tokens work well for most applications, with 512 tokens being a popular default. Very small chunks lose context, while very large chunks dilute relevance.

Question 12

What does "faithfulness" measure in RAG evaluation?

A) Whether the embedding model accurately represents document semantics B) Whether the generated response only contains information supported by the retrieved context C) Whether the user is satisfied with the response D) Whether the retrieval system returns relevant documents

Answer: B Explanation: Faithfulness (also called groundedness) measures whether every claim in the generated response is supported by the retrieved context. High faithfulness indicates low hallucination in the generation step.

Question 13

Which evaluation metric measures the ranking position of the first relevant document?

A) Recall@k B) Precision@k C) Mean Reciprocal Rank (MRR) D) Normalized Discounted Cumulative Gain (nDCG)

Answer: C Explanation: MRR computes the average of the reciprocal ranks of the first relevant document across all queries. If the first relevant document is at position 3, its reciprocal rank is 1/3.

Question 14

In the context of RAG, what is the "lost in the middle" problem?

A) Documents indexed in the middle of the database are harder to retrieve B) LLMs tend to focus on information at the beginning and end of long contexts, missing information in the middle C) Embedding models poorly represent the middle tokens of long documents D) Users tend to ask questions about topics in the middle of documents

Answer: B Explanation: Research has shown that LLMs have a bias toward attending to information at the beginning and end of the context window, often missing or underweighting information presented in the middle, which is known as the "lost in the middle" phenomenon.

Question 15

What is the key advantage of semantic chunking over fixed-size chunking?

A) Semantic chunking is always faster B) Semantic chunking produces chunks of exactly equal size C) Semantic chunking respects topic boundaries by splitting where content shifts D) Semantic chunking requires no embedding model

Answer: C Explanation: Semantic chunking uses embedding similarity between consecutive sentences to identify natural topic boundaries, ensuring that chunks contain coherent, topically consistent content rather than arbitrary segments that might split mid-topic.

Question 16

Which component of a RAG system is most likely to be the latency bottleneck?

A) Query embedding (10-50ms) B) Vector search (5-50ms) C) Document parsing at query time D) LLM generation (500-2000ms)

Answer: D Explanation: LLM generation typically takes 500-2000ms or more, making it the dominant component of end-to-end RAG latency. Query embedding and vector search together usually take less than 100ms.

Question 17

What is ColBERT's "MaxSim" operation?

A) Taking the maximum similarity between the query vector and all document vectors B) For each query token embedding, finding the maximum similarity with any document token embedding, then summing across query tokens C) Selecting the document with the maximum average similarity across all tokens D) Computing the maximum of cosine and dot product similarity

Answer: B Explanation: ColBERT's late interaction computes score(q,d) = sum_i max_j (q_i . d_j^T), where for each query token embedding, the maximum similarity with any document token embedding is found, and these maxima are summed to produce the final relevance score.

Question 18

In Graph RAG, what is the primary advantage over traditional text-based RAG?

A) It requires no vector database B) It can answer global questions that require synthesizing information across the entire corpus C) It is always faster than traditional RAG D) It eliminates the need for an LLM

Answer: B Explanation: Graph RAG builds community summaries of entity clusters in a knowledge graph, enabling answers to global/synthesis questions (e.g., "What are the main themes across all documents?") that traditional RAG, which retrieves individual chunks, struggles with.

Question 19

When should you prefer fine-tuning over RAG?

A) When you need to provide access to proprietary documents B) When you need to change the model's behavior, style, or task-specific formatting C) When you need to update knowledge frequently D) When you need source attribution for answers

Answer: B Explanation: Fine-tuning is better for changing model behavior, learning domain-specific language patterns, or enforcing specific output formats. RAG is preferred for factual knowledge, frequent updates, and source attribution.

Question 20

What does the temperature parameter $\tau$ control in the contrastive learning loss used for training embedding models?

A) The number of negative samples B) The learning rate decay schedule C) The sharpness of the similarity distribution — lower values make the distribution more peaked D) The maximum sequence length

Answer: C Explanation: The temperature parameter scales the similarity scores before softmax normalization. Lower temperatures make the distribution more peaked (sharper), forcing the model to make stronger distinctions between similar and dissimilar pairs.

Question 21

In a production RAG system, which caching strategy provides the greatest latency improvement?

A) Caching embedding model weights in GPU memory B) Caching LLM responses for identical query-context pairs C) Caching vector database connection pools D) Caching document parsing results

Answer: B Explanation: Since LLM generation is the largest latency component (500-2000ms), caching complete responses for identical query-context pairs eliminates the most expensive operation. This is especially effective when the same questions are asked frequently.

Question 22

Which of the following is NOT a valid approach for handling contradictory information in retrieved documents?

A) Present both perspectives with their sources B) Prioritize more recent or authoritative sources C) Silently choose one version and present it as definitive D) Let the user decide which information to trust

Answer: C Explanation: Silently choosing one version without transparency would be misleading. Proper handling involves acknowledging the contradiction, presenting multiple perspectives, citing sources, and potentially prioritizing based on recency or authority while being transparent about the conflict.

Question 23

What is the approximate storage requirement for 10 million 768-dimensional float32 vectors?

A) 3 GB B) 30 GB C) 300 GB D) 3 TB

Answer: B Explanation: Storage = 10,000,000 vectors x 768 dimensions x 4 bytes (float32) = 30,720,000,000 bytes or approximately 30 GB.

Question 24

In multi-step retrieval (iterative RAG), what is the primary benefit?

A) It is faster than single-step retrieval B) It can answer complex questions requiring information from multiple sources by decomposing and iteratively retrieving C) It requires fewer retrieved documents overall D) It eliminates the need for an embedding model

Answer: B Explanation: Multi-step retrieval decomposes complex questions into sub-questions, retrieves information for each, and uses intermediate answers to inform subsequent retrieval. This enables answering questions that require synthesizing information across multiple documents or following chains of reasoning.

Question 25

What is the recommended minimum number of evaluation examples for a production RAG system?

A) 10-20 B) 50-100 C) 200+ D) 1000+

Answer: C Explanation: While 50-100 examples is a minimum starting point, 200+ evaluation examples are recommended for production systems to ensure statistical reliability across different query types, difficulty levels, and edge cases.

Scoring Guide

Score	Level	Recommendation
23-25	Expert	Ready for production RAG system design
19-22	Advanced	Strong understanding, review advanced patterns
15-18	Intermediate	Good foundation, practice implementation
11-14	Developing	Review core concepts, especially retrieval strategies
0-10	Beginning	Re-read the chapter focusing on architecture and components