40 min read

> "A language model that answers from memory is a confident guesser. A language model that answers from your documents is a research assistant. The difference is architecture, not intelligence."

Chapter 21: AI-Powered Workflows

"A language model that answers from memory is a confident guesser. A language model that answers from your documents is a research assistant. The difference is architecture, not intelligence."

— Professor Diane Okonkwo, MBA 7620


The Sixty-Day Return Policy That Never Existed

NK pulls up a customer service transcript on the classroom screen. The interaction is from Athena Retail Group's pilot chatbot — the one Ravi's team launched three weeks ago to handle routine customer inquiries.

Customer: Hi, I bought a coat during the holiday sale and I'd like to return it. What's the return window?

Athena Chatbot: Thank you for your question! For items purchased during our holiday promotion, Athena offers a generous 60-day return window. Simply bring the item to any Athena store with your receipt, and we'll process your return promptly.

NK pauses. "Sounds helpful, right? Confident, polite, specific. There's just one problem." She switches to a second slide — a screenshot from Athena's actual Return Policy Manual, version 4.2, dated November 2025. The relevant section reads: Holiday promotional purchases are eligible for return within 45 days of purchase date. Standard return policies (30 days) apply to all non-promotional items.

"Forty-five days," NK says. "Not sixty. The chatbot hallucinated a plausible but incorrect answer. It sounded authoritative. It cited no sources. And if a customer acted on that information and showed up on day 55, Athena would face a choice between honoring a policy that doesn't exist or telling a customer that the company's own chatbot lied to them."

Tom, sitting in the front row, leans forward. "How many queries is this thing handling?"

NK glances at Ravi, who is attending class as a guest today. Ravi clears his throat. "About 1,200 per day. We audited a random sample of 200 responses last week. Twenty-eight percent contained at least one factual error about Athena policies. Some were subtle — wrong deadline here, wrong eligibility condition there. Some were completely fabricated."

The room goes quiet.

Professor Okonkwo steps forward. "This is not a failure of the language model. GPT-4, Claude, Gemini — these models were not trained on Athena's internal policy documents. They have no knowledge of your 45-day holiday return window. When asked a question they cannot answer from training data, they do what language models do: they generate the most plausible-sounding response. And plausible is not the same as accurate."

She writes two words on the whiteboard: RETRIEVAL and GENERATION.

"This is why RAG exists. Retrieval-Augmented Generation. Instead of asking the model to remember your policies, you retrieve the relevant documents first and then ask the model to generate an answer based on what it just read. The model stops being a rememberer and starts being a reader. And readers can cite their sources."

NK opens her notebook and writes: RAG = make the model read the actual document before answering. Why wasn't this the default from day one?

Tom writes in his notebook: Chunking strategy, embedding model, vector DB, retrieval quality — this is a systems engineering problem, not a prompt engineering problem.

They are both asking the right questions.


The RAG Paradigm: Why Grounding Matters

The Hallucination Problem

In Chapter 17, we introduced large language models — neural networks trained on vast corpora of text that can generate remarkably fluent, contextually appropriate language. In Chapters 19 and 20, we explored how prompt engineering techniques can steer these models toward more useful, structured, and reliable outputs.

But prompt engineering, no matter how sophisticated, cannot solve a fundamental limitation: language models generate responses based on statistical patterns learned during training, not from verified facts. When a model encounters a question about information it was never trained on — or information that changed after its training cutoff — it does not say "I don't know." It generates the most statistically probable continuation of the text. This is hallucination.

Definition: A hallucination in the context of large language models is a response that is fluent and plausible-sounding but factually incorrect, unsupported, or entirely fabricated. Hallucinations are not bugs — they are a natural consequence of how language models generate text. The model is optimizing for linguistic coherence, not factual accuracy.

Hallucination rates vary by model, domain, and question type, but studies consistently find that even the most capable models hallucinate on 5-20 percent of factual queries. For general knowledge questions where training data is abundant, rates are lower. For domain-specific, proprietary, or time-sensitive information — exactly the kind of information businesses care about most — rates are significantly higher.

For an enterprise deploying LLMs against internal knowledge bases, this is not an acceptable error rate. A financial advisor chatbot that fabricates compliance rules. A healthcare system that invents drug interaction warnings. A customer service bot that promises return policies that do not exist. These are not minor inconveniences — they are legal liabilities, reputational risks, and operational hazards.

The RAG Solution

Retrieval-Augmented Generation, introduced by Lewis et al. in a 2020 Meta AI Research paper, offers an elegant architectural solution: instead of relying solely on the model's parametric knowledge (what it learned during training), supplement it with non-parametric knowledge (documents retrieved at query time).

The core insight is simple and powerful:

  1. When a user asks a question, retrieve the most relevant documents from a knowledge base.
  2. Insert those documents into the prompt as context.
  3. Ask the model to generate an answer based on the retrieved context.

The model no longer needs to "know" the answer. It needs to read the answer from the documents you provide and synthesize a response. This is the difference between asking someone to answer a question from memory versus handing them a reference manual and asking them to find and explain the relevant section.

Business Insight: RAG does not eliminate hallucination entirely — the model can still misinterpret retrieved documents or generate answers that go beyond the provided context. But it dramatically reduces hallucination rates for domain-specific queries. More importantly, RAG enables citation: the system can point to exactly which documents informed its answer, allowing humans to verify responses. For regulated industries and enterprise applications, this traceability is often a hard requirement.

Why Not Just Fine-Tune?

A natural question: why not fine-tune the language model on your proprietary data instead of retrieving documents at query time? Fine-tuning — updating the model's weights using your own data — is a legitimate approach, and we discussed it in Chapter 17. But RAG and fine-tuning serve different purposes and have different trade-offs:

Dimension Fine-Tuning RAG
Knowledge update speed Slow — requires retraining Fast — update the document store
Data freshness Stale until retrained Current as of last document update
Cost High (GPU compute for training) Lower (embedding + storage + retrieval)
Traceability Low — knowledge baked into weights High — can cite source documents
Hallucination control Moderate — model may still confabulate Strong — answers grounded in retrieved text
Best for Teaching the model a new style or format Grounding the model in specific facts and documents

In practice, many production systems combine both approaches: fine-tuning to teach the model the desired tone, format, and domain vocabulary, and RAG to ground its answers in current, authoritative documents. But if you must choose one, RAG is almost always the better starting point for enterprise knowledge applications.

Caution

Fine-tuning on proprietary data introduces data security considerations that RAG avoids. When you fine-tune a model, your data becomes embedded in the model's weights — potentially extractable through adversarial prompting. With RAG, your documents remain in a separate, access-controlled store. We will explore this distinction further in Chapter 29 (Privacy, Security, and AI).


RAG Architecture: The Complete Pipeline

A RAG system is not a single component — it is a pipeline of interconnected stages. Understanding the full pipeline is essential for building, evaluating, and improving RAG systems. Here is the end-to-end architecture:

The Two Phases of RAG

RAG operates in two distinct phases:

Phase 1: Indexing (offline, done once or periodically) 1. Load documents from their source (PDFs, databases, wikis, APIs) 2. Chunk documents into smaller, semantically meaningful pieces 3. Embed each chunk into a vector representation 4. Store embeddings in a vector database with metadata

Phase 2: Querying (online, done per user request) 1. Receive a user query 2. Embed the query using the same embedding model 3. Retrieve the most similar document chunks from the vector database 4. Augment the prompt with the retrieved chunks as context 5. Generate a response using an LLM, instructed to answer from the provided context 6. Return the response with source citations

┌─────────────────────────────────────────────────────────────────┐
│                     INDEXING PHASE (Offline)                     │
│                                                                 │
│  Documents → Chunking → Embedding → Vector Database (Store)    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    QUERYING PHASE (Online)                       │
│                                                                 │
│  User Query → Embed Query → Retrieve Similar Chunks →           │
│  Augment Prompt → LLM Generation → Response + Citations         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Each stage involves design decisions that significantly affect system quality. A poorly chunked document will produce poor retrievals regardless of how good the embedding model or LLM is. A brilliant retrieval strategy is wasted if the LLM prompt does not instruct the model to stay grounded in the retrieved context. Let us examine each stage in depth.


Embeddings Deep Dive: Text as Vectors

What Is an Embedding?

At the heart of RAG — and indeed at the heart of modern NLP — is the concept of embeddings. An embedding is a numerical representation of text as a vector (a list of numbers) in a high-dimensional space, where semantically similar texts are mapped to nearby points.

Definition: An embedding is a dense vector representation of text (or other data) in a continuous vector space. Embedding models are trained so that texts with similar meanings produce vectors that are close together, as measured by cosine similarity or other distance metrics. A sentence like "What is the return policy?" and "How do I return an item?" would have embeddings that are very close to each other, even though they share few words.

Consider these three sentences:

  1. "What is Athena's return policy for holiday purchases?"
  2. "How do I return items bought during the holiday sale?"
  3. "What are Athena's quarterly revenue figures?"

An embedding model would produce vectors where sentences 1 and 2 are very close together (they are semantically similar — both ask about holiday returns) and sentence 3 is far from both (it is about revenue, not returns). This is true even though sentence 1 and sentence 3 share the word "Athena" and sentences 1 and 2 share relatively few exact words.

This is the magic of semantic search over keyword search. Traditional keyword matching would rank sentence 3 higher than sentence 2 for a query containing "Athena" and "policy." Embedding-based search understands meaning, not just word overlap.

How Embeddings Work

Embedding models are neural networks — typically transformer architectures (Chapter 14) — trained on massive text corpora using objectives that encourage the model to produce similar representations for semantically similar texts. The most common training approaches include:

  • Contrastive learning: The model learns to produce similar embeddings for text pairs known to be related (e.g., a question and its answer, a sentence and its paraphrase) and dissimilar embeddings for unrelated pairs.
  • Masked language modeling: The model predicts missing words in context, learning rich representations of meaning as a byproduct.

The output is a vector — typically 384 to 3,072 dimensions depending on the model — that captures the semantic content of the input text.

Embedding Models for Business Applications

Several embedding models are widely used in production RAG systems:

Model Dimensions Provider Notes
text-embedding-3-small 1,536 OpenAI Good balance of quality and cost
text-embedding-3-large 3,072 OpenAI Highest quality, higher cost
all-MiniLM-L6-v2 384 Sentence Transformers (open source) Fast, free, good for prototyping
all-mpnet-base-v2 768 Sentence Transformers (open source) Better quality, still free
voyage-3 1,024 Voyage AI Strong retrieval performance
embed-v3 1,024 Cohere Enterprise-focused

Business Insight: The choice of embedding model matters, but it matters less than most people think. In benchmarks, the difference between a good embedding model and the best embedding model is often 2-5 percent in retrieval quality. The difference between good chunking and bad chunking can be 20-40 percent. Spend your optimization budget on chunking and retrieval strategy before obsessing over embedding models.

Measuring Similarity

Once texts are embedded as vectors, we need a way to measure how "close" two vectors are. The most common metric is cosine similarity:

$$\text{cosine similarity}(A, B) = \frac{A \cdot B}{\|A\| \times \|B\|}$$

Cosine similarity ranges from -1 (opposite meanings) to 1 (identical meanings), with 0 indicating no relationship. In practice, most embedding models produce values between 0 and 1 for text embeddings.

Other distance metrics include Euclidean distance and dot product, but cosine similarity is the default for most RAG applications because it normalizes for vector magnitude — a longer document does not automatically appear "more similar" simply because its vector has larger values.


Vector Databases: Storing and Searching Embeddings

Why Not Just Use a Regular Database?

If embeddings are just lists of numbers, why can't you store them in a PostgreSQL table and query them with SQL? Technically, you can. But a traditional database query looks like this: SELECT * FROM products WHERE category = 'shoes' AND price < 100. This is exact matching — the database checks each record against precise conditions.

Vector search requires something fundamentally different: approximate nearest neighbor (ANN) search. Given a query vector, find the K vectors in the database that are most similar to it. In a database with millions of vectors, each with 1,536 dimensions, brute-force comparison is computationally infeasible. Vector databases solve this with specialized indexing algorithms that trade a small amount of accuracy for dramatic improvements in speed.

Definition: A vector database is a database purpose-built for storing, indexing, and querying high-dimensional vectors. Vector databases use approximate nearest neighbor (ANN) algorithms to find similar vectors in milliseconds, even across millions or billions of records.

Leading Vector Databases

The vector database landscape has exploded since 2023. Here are the key players:

ChromaDB is an open-source, lightweight vector database designed for rapid prototyping and small-to-medium deployments. It runs in-memory, embeds natively with Python, and requires zero infrastructure setup. For learning and building MVPs, ChromaDB is the fastest path to a working RAG system. We will use ChromaDB in our code examples.

Pinecone is a fully managed, cloud-native vector database designed for production workloads. It handles scaling, replication, and infrastructure management, allowing teams to focus on application logic rather than database operations. Pinecone's managed nature makes it popular with enterprise teams that want to deploy RAG without managing infrastructure.

Weaviate is an open-source vector database with built-in ML model integration. It can vectorize data at import time, supports hybrid (keyword + vector) search out of the box, and offers both self-hosted and managed cloud options. Its GraphQL API and multi-modal capabilities (text, images, video) make it flexible for complex applications.

Milvus / Zilliz is an open-source vector database built for billion-scale vector search. It offers the most advanced indexing options (IVF, HNSW, DiskANN) and is designed for teams that need to handle extremely large datasets with strict latency requirements.

pgvector is a PostgreSQL extension that adds vector similarity search to an existing PostgreSQL database. For teams already running PostgreSQL, pgvector avoids introducing a new database into the stack. It is less performant than purpose-built vector databases at scale but often "good enough" for applications with fewer than 10 million vectors.

Business Insight: The build-vs-buy decision applies to vector databases just as it applies to any infrastructure. For prototyping and internal tools, ChromaDB or pgvector may be sufficient. For production customer-facing applications at scale, a managed service like Pinecone reduces operational burden. Evaluate based on your expected data volume, latency requirements, team capabilities, and budget. We will revisit infrastructure decisions in Chapter 23 (Cloud AI Services and APIs).

Indexing Strategies

Vector databases use specialized data structures to enable fast approximate nearest neighbor search. The most common approaches:

HNSW (Hierarchical Navigable Small World) builds a multi-layer graph where each node is connected to its nearest neighbors. Search starts at the top layer (coarse) and drills down to lower layers (fine), efficiently navigating toward the nearest neighbors. HNSW offers excellent recall (accuracy) and query speed, making it the default choice for most applications.

IVF (Inverted File Index) partitions the vector space into clusters using k-means clustering. At query time, only the nearest clusters are searched, reducing the search space dramatically. IVF is faster to build than HNSW but typically offers slightly lower recall.

Flat Index (brute force) compares the query vector against every vector in the database. This guarantees perfect recall but scales poorly — acceptable for small datasets (under 100,000 vectors), impractical for larger ones.

For most business applications, HNSW is the recommended default. It provides the best balance of recall, speed, and memory usage.


Chunking Strategies: How You Split Documents Matters

Chunking — the process of splitting documents into smaller pieces before embedding — is arguably the most important and most underappreciated stage of the RAG pipeline. Tom discovered this firsthand when building Athena's knowledge base.

"I spent two days trying different embedding models," Tom told the class. "Switched from MiniLM to OpenAI's text-embedding-3-large. Retrieval quality improved maybe 3 percent. Then I spent an afternoon redesigning my chunking strategy — from fixed-size 500-character chunks to recursive splitting with semantic boundaries — and retrieval quality improved 25 percent. Chunking matters more than model choice."

Why Chunking Matters

Language models have context windows — maximum input lengths measured in tokens. Even with models supporting 128K or 200K token windows, you cannot dump an entire 5,000-document knowledge base into a single prompt. You must retrieve only the most relevant pieces. And the quality of those pieces depends entirely on how you chunked the original documents.

The chunking dilemma is this:

  • Too large: A 2,000-word chunk about returns, shipping, and warranty policies will be retrieved for any query about any of those topics, diluting relevance with irrelevant content. The LLM receives a lot of text but must work harder to find the specific answer.
  • Too small: A 50-word chunk might contain a sentence fragment that is impossible to understand without its surrounding context. "45 days from the date of purchase" means nothing without knowing what has a 45-day window.
  • Just right: A chunk that is small enough to be topically focused but large enough to be self-contained — typically 200-500 words for most enterprise documents.

Common Chunking Strategies

Fixed-Size Chunking splits documents into chunks of a predetermined character or token count (e.g., every 500 characters). This is the simplest approach and works reasonably well for uniformly structured documents. Its weakness is that it splits without regard for semantic boundaries — a paragraph about return policies might be cut in half, with the return window in one chunk and the eligibility conditions in another.

Recursive Character Splitting is the most popular strategy for general-purpose RAG. It attempts to split on natural boundaries in a priority order: first on double newlines (paragraph breaks), then single newlines, then sentences, then words. This preserves semantic coherence — a paragraph stays together if it fits within the chunk size limit. The LangChain library's RecursiveCharacterTextSplitter is the canonical implementation.

Semantic Chunking uses an embedding model to detect topic shifts within a document and places chunk boundaries at those transition points. This produces chunks that are thematically cohesive, regardless of length. The trade-off is computational cost — you must embed every sentence to detect topic boundaries — and inconsistent chunk sizes.

Document-Aware Chunking uses the document's inherent structure — headings, sections, bullet points, tables — to define chunk boundaries. A well-structured policy document with H2 headings for each policy area is naturally pre-chunked. This approach requires parsing the document format (HTML, Markdown, PDF) and understanding its structure.

Chunk Overlap

Regardless of chunking strategy, chunk overlap is critical. Overlap means that the last N characters of chunk K appear as the first N characters of chunk K+1. This ensures that information spanning a chunk boundary is captured in at least one complete chunk.

A typical overlap of 10-20 percent of chunk size works well. For 500-character chunks, an overlap of 50-100 characters ensures continuity across boundaries.

Try It: Take a 2,000-word document (a company policy, a product FAQ, a research report) and chunk it three ways: (1) fixed 500-character chunks with no overlap, (2) fixed 500-character chunks with 100-character overlap, and (3) recursive splitting on paragraph boundaries. For each approach, ask yourself: could a reader understand each chunk in isolation? Does each chunk contain a complete thought? The approach that produces the most self-contained chunks will produce the best retrieval results.

Metadata: The Secret Weapon

Each chunk should carry metadata — additional information beyond the text itself. Metadata enables filtered retrieval (search only policy documents, not marketing materials), temporal awareness (prefer recent documents), and source attribution (cite the specific document and section).

Common metadata fields include:

  • Source document: file name, URL, document title
  • Section: heading, page number, section ID
  • Date: creation date, last modified date, effective date
  • Category: policy, FAQ, product guide, compliance
  • Version: document version number
  • Author / owner: who is responsible for this content

Athena Update: When Ravi's team built Athena's knowledge base, they included a last_reviewed date in every chunk's metadata. This allowed the system to flag responses based on documents that had not been reviewed in more than 90 days — a simple but effective governance mechanism that addressed the stale document problem before it became a customer-facing issue.


Retrieval Strategies: Finding the Right Documents

With documents chunked, embedded, and stored, the next challenge is retrieval: given a user query, find the chunks most likely to contain the answer. Retrieval quality is the single largest determinant of RAG system quality. A perfect LLM cannot generate a correct answer from irrelevant context.

Similarity Search (Dense Retrieval)

The default retrieval strategy in most RAG systems is vector similarity search: embed the query, find the K nearest vectors in the database, return the corresponding chunks. This is semantic search — it finds documents that are about the same thing as the query, even if they use different words.

Strengths: Handles synonyms, paraphrases, and conceptual queries naturally. A query about "sending items back" will retrieve chunks about "return policies" even though the words differ.

Weaknesses: Can struggle with exact keyword matching. If a user asks about "SKU-4892A," a semantic search might retrieve chunks about product returns in general rather than the specific product, because the embedding captures the general topic but not the specific identifier.

Keyword Search (Sparse Retrieval)

Traditional keyword search — BM25, TF-IDF — matches queries to documents based on shared terms. It excels at exact matching: product IDs, legal citations, technical codes, proper nouns.

Strengths: Precise for exact terms, fast, well-understood.

Weaknesses: No understanding of meaning. "Return policy" and "sending merchandise back" are completely different queries to a keyword system.

Hybrid Search: The Best of Both Worlds

Hybrid search combines semantic (dense) and keyword (sparse) retrieval, typically by running both searches in parallel and merging the results. This approach captures both the meaning-based matches from embeddings and the precision of keyword matching.

The most common fusion strategy is Reciprocal Rank Fusion (RRF): for each document, compute a score based on its rank in each search result list, then sort by the combined score. Documents that appear high in both lists are ranked highest.

RRF Score = Σ (1 / (k + rank_i))

Where k is a constant (typically 60) and rank_i is the document's position in each result list.

Business Insight: For enterprise RAG systems, hybrid search is almost always superior to pure vector search. Business documents contain a mix of natural language (where semantic search excels) and specific identifiers — product codes, policy numbers, employee IDs, legal citations (where keyword search excels). Hybrid search handles both gracefully.

Re-Ranking

Initial retrieval — whether semantic, keyword, or hybrid — returns a candidate set of chunks. Re-ranking applies a more sophisticated (and more expensive) model to re-order those candidates by relevance. The initial retrieval casts a wide net; re-ranking applies fine-grained judgment.

Re-ranking models like Cohere Rerank or cross-encoder models from the Sentence Transformers library score each (query, chunk) pair independently, producing a more accurate relevance ranking than the initial embedding similarity alone.

The trade-off is latency: re-ranking adds 50-200ms per query. For applications where accuracy is more important than speed (internal knowledge bases, compliance tools), re-ranking is well worth the cost. For high-throughput, low-latency applications, it may not be.

Multi-Query Retrieval

A single user query may be ambiguous or under-specified. Multi-query retrieval addresses this by generating multiple reformulations of the original query and retrieving documents for each variation. The results are merged and deduplicated.

For example, the query "What's Athena's holiday policy?" might be reformulated as:

  1. "Athena Retail Group holiday return policy"
  2. "How long do customers have to return holiday purchases?"
  3. "Holiday season purchase return window Athena"

Each reformulation retrieves slightly different chunks, and the union provides more comprehensive coverage than any single query.


AI Agents: Models That Act

RAG solves the knowledge grounding problem. But real-world business workflows often require more than question-answering. They require action — looking up data, making calculations, calling APIs, sending notifications, updating databases. This is the domain of AI agents.

Definition: An AI agent is a system that uses a language model as its reasoning engine to plan and execute multi-step tasks. Unlike a chatbot (which generates text in response to a prompt), an agent can observe its environment, reason about what actions to take, execute those actions using tools, observe the results, and decide what to do next.

The Agent Loop

An agent operates in a loop:

  1. Observe: Receive a task or query from the user
  2. Think: Reason about what steps are needed (often using chain-of-thought, as covered in Chapter 20)
  3. Act: Select and execute a tool (API call, database query, calculation, web search)
  4. Observe: Examine the tool's output
  5. Repeat: Decide whether the task is complete or additional steps are needed
  6. Respond: Deliver the final answer to the user

This observe-think-act loop — sometimes called the ReAct pattern (Reasoning + Acting) — allows agents to handle tasks that no single prompt could solve. Consider a business scenario: "What was our top-selling product last quarter, and how does its margin compare to the category average?" An agent would:

  1. Query the sales database for last quarter's top product
  2. Query the margin database for that product's margin
  3. Query the margin database for the category average
  4. Calculate the comparison
  5. Generate a natural language summary

No single LLM prompt can do this. But an agent equipped with database query tools and a calculator can.

Agents in Business Context

The agent paradigm is transforming enterprise software. Here are real patterns emerging in 2025-2026:

Customer Service Agents handle multi-step customer requests: checking order status (API call), looking up return eligibility (database query), initiating a return (transaction), and sending confirmation (email API). These go beyond chatbots by actually doing things, not just saying things.

Research Agents gather and synthesize information from multiple sources: querying internal databases, searching the web, reading documents, and producing structured reports. A financial analyst agent might pull quarterly earnings data, compare it to analyst estimates, summarize relevant news, and draft a briefing memo.

Workflow Automation Agents orchestrate multi-step business processes: receiving an expense report, extracting line items, checking against policy, flagging exceptions, routing for approval, and updating the accounting system.

Caution

Agent autonomy is a spectrum, not a switch. Today's production agents typically operate with significant guardrails: predefined tool sets, confirmation steps for high-stakes actions, fallback to human handoff when confidence is low. Fully autonomous agents that make consequential decisions without human oversight remain rare in enterprise settings — and for good reason. The accountability question ("Who is responsible when the agent makes a mistake?") is unresolved. We will explore this in Chapter 27 (AI Governance Frameworks).


Tool Use and Function Calling

Agents are only as capable as the tools they can access. Tool use (also called function calling) is the mechanism by which language models interact with external systems.

How Function Calling Works

Modern LLMs from OpenAI, Anthropic, Google, and others support structured function calling. Instead of generating free-form text, the model generates a structured request to call a specific function with specific arguments. The application executes the function and returns the result to the model, which incorporates it into its response.

Here is the conceptual flow:

  1. Developer defines available tools with names, descriptions, and parameter schemas
  2. User sends a query ("What's the return status of order #78432?")
  3. Model decides to call a tool rather than generating text
  4. Model outputs a structured function call: check_order_status(order_id="78432")
  5. Application executes the function against the real system
  6. Result is returned to the model: {"status": "Returned", "refund_amount": 89.99, "date": "2026-02-28"}
  7. Model generates a natural language response incorporating the result
# Example: Defining tools for an LLM agent
tools = [
    {
        "name": "check_order_status",
        "description": "Look up the current status of a customer order by order ID.",
        "parameters": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "The unique order identifier (e.g., '78432')"
                }
            },
            "required": ["order_id"]
        }
    },
    {
        "name": "get_return_policy",
        "description": "Retrieve the return policy for a specific product category.",
        "parameters": {
            "type": "object",
            "properties": {
                "category": {
                    "type": "string",
                    "description": "Product category (e.g., 'apparel', 'electronics')"
                },
                "purchase_type": {
                    "type": "string",
                    "enum": ["standard", "holiday", "clearance"],
                    "description": "Type of purchase"
                }
            },
            "required": ["category"]
        }
    },
    {
        "name": "calculate_refund",
        "description": "Calculate the refund amount for a return, applying any applicable fees.",
        "parameters": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string"},
                "reason": {
                    "type": "string",
                    "enum": ["defective", "wrong_item", "changed_mind", "other"]
                }
            },
            "required": ["order_id", "reason"]
        }
    }
]

Code Explanation: This Python dictionary defines three tools that an LLM agent can call. Each tool has a name, a human-readable description (which the model uses to decide when to call the tool), and a JSON Schema defining its parameters. The model reads these definitions and autonomously decides which tool to call based on the user's query. Note that the developer defines the interface; the actual implementation of each function (the code that queries the database, applies business logic, etc.) is separate.

Tool Design Principles

Effective tool design follows principles that will feel familiar from API design and software engineering:

  1. Single responsibility. Each tool does one thing well. check_order_status checks status; calculate_refund calculates refunds. Don't build a mega-tool that does everything.

  2. Clear descriptions. The model relies on the tool description to decide when to use it. Vague descriptions lead to incorrect tool selection. "Look up the current status of a customer order by order ID" is clear. "Handle order stuff" is not.

  3. Constrained parameters. Use enums, required fields, and type constraints to limit what the model can send. This reduces errors and improves reliability.

  4. Safe by default. Read operations (checking status, looking up policies) should be freely available. Write operations (initiating refunds, updating records) should require confirmation or operate in a sandbox.

  5. Graceful error handling. Tools should return informative error messages, not crash. If an order ID is invalid, the tool should return {"error": "Order not found"} so the model can inform the user, not throw an unhandled exception.


Workflow Orchestration: Frameworks for Building AI Systems

Building a RAG pipeline or an agent system from scratch requires connecting many components: document loaders, text splitters, embedding models, vector databases, LLMs, prompt templates, output parsers, and evaluation tools. Orchestration frameworks provide pre-built abstractions for these components, reducing boilerplate and accelerating development.

LangChain

LangChain, created by Harrison Chase in late 2022, is the most widely adopted orchestration framework for LLM applications. It provides:

  • Document loaders for PDFs, web pages, databases, APIs, and dozens of other sources
  • Text splitters implementing various chunking strategies
  • Embedding wrappers for OpenAI, Cohere, HuggingFace, and other providers
  • Vector store integrations for ChromaDB, Pinecone, Weaviate, Milvus, and others
  • Chain abstractions for composing multi-step LLM workflows
  • Agent frameworks with tool-use capabilities
  • LangSmith for monitoring, tracing, and evaluating LLM applications

LangChain's strength is breadth: it supports virtually every LLM, embedding model, and vector database on the market. Its weakness, historically, has been abstraction instability — the API changed frequently in its early days, and some abstractions added complexity without proportionate value.

LlamaIndex

LlamaIndex (formerly GPT Index) is an orchestration framework specifically focused on connecting LLMs to data sources. While LangChain is a general-purpose framework that includes RAG as one of many capabilities, LlamaIndex is purpose-built for RAG and excels at:

  • Advanced indexing strategies: tree indices, keyword table indices, knowledge graph indices
  • Query routing: automatically selecting the best retrieval strategy based on query type
  • Response synthesis: sophisticated methods for combining information from multiple retrieved chunks
  • Structured data integration: querying SQL databases, pandas DataFrames, and APIs alongside unstructured text

When to Use a Framework vs. Build Custom

Business Insight: The framework vs. custom decision mirrors the broader build-vs-buy theme of this textbook. Use a framework when: (1) you are prototyping and need speed, (2) your use case fits common patterns, (3) your team values ecosystem compatibility. Build custom when: (1) you need maximum control over every pipeline stage, (2) framework abstractions add latency or complexity you cannot afford, (3) your use case is sufficiently unique that framework conventions hinder rather than help. Many production systems start with a framework for prototyping and gradually replace framework components with custom code as requirements become clearer.


Evaluation for RAG: Measuring What Matters

RAG systems introduce evaluation challenges beyond traditional ML metrics. A RAG response depends on retrieval quality and generation quality — both must be measured.

The RAG Evaluation Framework

Evaluation happens at three levels:

1. Retrieval Quality — Did we find the right documents?

  • Precision@K: Of the K chunks retrieved, how many were actually relevant?
  • Recall@K: Of all relevant chunks in the knowledge base, how many were retrieved?
  • Mean Reciprocal Rank (MRR): How high in the result list does the first relevant chunk appear?

2. Generation Quality — Did the model answer correctly from the context?

  • Faithfulness: Does the generated answer accurately reflect the retrieved context? (Did the model stay grounded, or did it add information not in the retrieved chunks?)
  • Answer relevance: Does the generated answer actually address the user's question?
  • Completeness: Does the answer include all the important information from the retrieved context?

3. End-to-End Quality — Is the user's question answered correctly?

  • Correctness: Is the final answer factually correct? (Requires ground-truth labels)
  • Helpfulness: Would a human evaluator rate the answer as useful?
  • Citation accuracy: Do the cited sources actually support the claims in the answer?

Definition: Faithfulness in RAG evaluation measures whether the generated answer is supported by the retrieved context. A faithful answer only makes claims that are present in the retrieved documents. An unfaithful answer goes beyond the context — adding information from the model's training data (which may be incorrect) or hallucinating details not present in any source.

RAGAS: A Framework for RAG Evaluation

RAGAS (Retrieval-Augmented Generation Assessment) is an open-source framework that automates RAG evaluation using LLM-as-a-judge techniques. It computes metrics including:

  • Faithfulness: Using an LLM to check whether each claim in the answer is supported by the context
  • Answer relevance: Using an LLM to assess whether the answer addresses the question
  • Context precision: Whether the retrieved chunks are relevant to the question
  • Context recall: Whether the retrieved context covers all aspects of the ground-truth answer

RAGAS is valuable because it enables automated evaluation at scale — you can evaluate thousands of query-response pairs without manual review.

Caution

LLM-as-a-judge evaluation has known limitations. The evaluating LLM may have its own biases, may not catch subtle errors, and may not agree with human judges in edge cases. Use automated evaluation for rapid iteration and broad quality monitoring, but validate critical findings with human review.


Building Athena's RAG System: A Complete Implementation

Now let us build what Ravi's team built — a RAG pipeline for Athena's customer service knowledge base. We will implement the complete pipeline step by step: document loading, chunking, embedding, storage, retrieval, generation, and evaluation.

Athena Update: Athena's customer service team handles approximately 15,000 inquiries per week across phone, email, and chat. Agents currently search a SharePoint site with 5,000+ policy documents to find answers — a process that averages 4.5 minutes per lookup. Ravi's goal: build a RAG-powered "policy co-pilot" that retrieves the relevant policy section in under 3 seconds and drafts a response the agent can review and send. The target: reduce average handle time by 30 percent while improving accuracy to 90 percent on policy questions.

Step 1: Document Loading and Preprocessing

"""
Athena Policy Knowledge Base — RAG Pipeline
============================================
A complete Retrieval-Augmented Generation system for
Athena Retail Group's customer service knowledge base.

This pipeline:
1. Loads and preprocesses policy documents
2. Chunks documents with recursive splitting
3. Generates embeddings with sentence-transformers
4. Stores embeddings in ChromaDB
5. Retrieves relevant context for user queries
6. Generates grounded answers with source citations
7. Evaluates response quality

Requirements:
    pip install chromadb sentence-transformers openai tiktoken
"""

from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import Optional
import hashlib
import json
import re
import textwrap


# ── Data Models ──────────────────────────────────────────────────

@dataclass
class Document:
    """Represents a source document in Athena's knowledge base."""
    content: str
    metadata: dict = field(default_factory=dict)
    doc_id: str = ""

    def __post_init__(self):
        if not self.doc_id:
            self.doc_id = hashlib.md5(
                self.content[:500].encode()
            ).hexdigest()


@dataclass
class Chunk:
    """A chunk of text derived from a source document."""
    content: str
    metadata: dict = field(default_factory=dict)
    chunk_id: str = ""
    embedding: list = field(default_factory=list)

    def __post_init__(self):
        if not self.chunk_id:
            self.chunk_id = hashlib.md5(
                self.content.encode()
            ).hexdigest()[:12]


@dataclass
class RetrievalResult:
    """A retrieved chunk with its similarity score."""
    chunk: Chunk
    score: float
    rank: int


@dataclass
class RAGResponse:
    """The final response from the RAG pipeline."""
    answer: str
    sources: list = field(default_factory=list)
    query: str = ""
    retrieval_scores: list = field(default_factory=list)
    metadata: dict = field(default_factory=dict)

Code Explanation: We define four data classes that model the RAG pipeline's data flow. Document represents a raw source document with its metadata. Chunk represents a smaller piece derived from a document — the unit of storage and retrieval. RetrievalResult pairs a chunk with its similarity score and rank. RAGResponse bundles the final answer with its source citations and diagnostic information. These clean data structures make the pipeline's flow explicit and testable.

Step 2: Document Preprocessing

class DocumentPreprocessor:
    """
    Cleans and normalizes documents before chunking.

    Handles common issues in enterprise document collections:
    whitespace inconsistencies, encoding artifacts, boilerplate
    headers/footers, and metadata extraction from structured text.
    """

    @staticmethod
    def clean_text(text: str) -> str:
        """Normalize whitespace and remove common artifacts."""
        # Normalize unicode characters
        text = text.replace("\u2019", "'").replace("\u2018", "'")
        text = text.replace("\u201c", '"').replace("\u201d", '"')
        text = text.replace("\u2013", "-").replace("\u2014", "--")

        # Collapse multiple blank lines into two newlines
        text = re.sub(r"\n{3,}", "\n\n", text)

        # Remove trailing whitespace per line
        text = "\n".join(line.rstrip() for line in text.split("\n"))

        # Strip leading/trailing whitespace
        text = text.strip()

        return text

    @staticmethod
    def extract_title(text: str) -> str:
        """Extract document title from first heading or line."""
        lines = text.strip().split("\n")
        for line in lines:
            line = line.strip()
            if line.startswith("# "):
                return line.lstrip("# ").strip()
            if line and not line.startswith("---"):
                return line[:100]
        return "Untitled"

    @staticmethod
    def estimate_reading_level(text: str) -> str:
        """Simple heuristic for document complexity."""
        words = text.split()
        if not words:
            return "unknown"
        avg_word_length = sum(len(w) for w in words) / len(words)
        if avg_word_length > 6:
            return "technical"
        elif avg_word_length > 5:
            return "professional"
        else:
            return "general"

    def preprocess(self, doc: Document) -> Document:
        """Apply all preprocessing steps to a document."""
        cleaned_content = self.clean_text(doc.content)
        title = self.extract_title(cleaned_content)

        doc.content = cleaned_content
        doc.metadata["title"] = title
        doc.metadata["word_count"] = len(cleaned_content.split())
        doc.metadata["reading_level"] = self.estimate_reading_level(
            cleaned_content
        )
        doc.metadata["preprocessed_at"] = datetime.now().isoformat()

        return doc

Step 3: Chunking with Recursive Character Splitting

class RecursiveChunker:
    """
    Splits documents into chunks using recursive character splitting.

    Attempts to split on natural boundaries (paragraphs, then
    sentences, then words) while respecting a maximum chunk size.
    Includes configurable overlap to prevent information loss
    at chunk boundaries.

    This is the chunking strategy Tom found most effective for
    Athena's policy documents — a mix of structured headings,
    paragraph prose, and bulleted lists.
    """

    def __init__(
        self,
        chunk_size: int = 500,
        chunk_overlap: int = 100,
        separators: list = None,
    ):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.separators = separators or [
            "\n\n",   # Paragraph breaks (highest priority)
            "\n",     # Line breaks
            ". ",     # Sentence endings
            ", ",     # Clause boundaries
            " ",      # Word boundaries (last resort)
        ]

    def _split_text(
        self, text: str, separators: list
    ) -> list[str]:
        """Recursively split text using a hierarchy of separators."""
        if not separators:
            # Base case: no separators left, split by character
            return [text[i:i + self.chunk_size]
                    for i in range(0, len(text), self.chunk_size)]

        separator = separators[0]
        remaining_separators = separators[1:]

        # Split on the current separator
        splits = text.split(separator)

        chunks = []
        current_chunk = ""

        for split in splits:
            # If adding this split would exceed chunk_size
            candidate = (
                current_chunk + separator + split
                if current_chunk else split
            )

            if len(candidate) <= self.chunk_size:
                current_chunk = candidate
            else:
                # Save current chunk if it has content
                if current_chunk:
                    chunks.append(current_chunk.strip())

                # If this individual split exceeds chunk_size,
                # recurse with finer separators
                if len(split) > self.chunk_size:
                    sub_chunks = self._split_text(
                        split, remaining_separators
                    )
                    chunks.extend(sub_chunks)
                    current_chunk = ""
                else:
                    current_chunk = split

        # Don't forget the last chunk
        if current_chunk:
            chunks.append(current_chunk.strip())

        return chunks

    def _add_overlap(self, chunks: list[str]) -> list[str]:
        """Add overlap between consecutive chunks."""
        if self.chunk_overlap == 0 or len(chunks) <= 1:
            return chunks

        overlapped = [chunks[0]]

        for i in range(1, len(chunks)):
            # Take the last `chunk_overlap` characters from
            # the previous chunk as a prefix
            prev_text = chunks[i - 1]
            overlap_text = prev_text[-self.chunk_overlap:]

            # Find a clean break point (start of a word)
            space_idx = overlap_text.find(" ")
            if space_idx != -1:
                overlap_text = overlap_text[space_idx + 1:]

            overlapped.append(overlap_text + " " + chunks[i])

        return overlapped

    def chunk_document(self, doc: Document) -> list[Chunk]:
        """Split a document into overlapping chunks with metadata."""
        raw_chunks = self._split_text(doc.content, self.separators)
        overlapped_chunks = self._add_overlap(raw_chunks)

        chunks = []
        for i, text in enumerate(overlapped_chunks):
            if not text.strip():
                continue

            chunk_metadata = {
                **doc.metadata,
                "chunk_index": i,
                "total_chunks": len(overlapped_chunks),
                "source_doc_id": doc.doc_id,
                "chunk_char_count": len(text),
            }

            chunks.append(Chunk(
                content=text.strip(),
                metadata=chunk_metadata,
            ))

        return chunks

Code Explanation: The RecursiveChunker implements the recursive character splitting strategy Tom found most effective. It tries to split on paragraph breaks first, preserving the most natural document structure. If a section is still too large, it falls back to line breaks, then sentences, then words. The _add_overlap method ensures that information spanning chunk boundaries appears in at least one complete chunk. The metadata propagation from document to chunk ensures that every chunk carries its source attribution — essential for citation in the final response.

Step 4: Embedding and Vector Store

class SimpleEmbedder:
    """
    Generates embeddings for text using sentence-transformers.

    In production, you would use a more sophisticated embedding
    model (OpenAI text-embedding-3-large, Cohere embed-v3, etc.).
    This implementation uses the open-source all-MiniLM-L6-v2
    model, which runs locally without API keys — ideal for
    prototyping and learning.

    For the MBA classroom, we also provide a mock embedder
    that simulates embedding behavior without requiring
    model downloads.
    """

    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model_name = model_name
        self._model = None

    def _load_model(self):
        """Lazy-load the embedding model."""
        if self._model is None:
            try:
                from sentence_transformers import SentenceTransformer
                self._model = SentenceTransformer(self.model_name)
            except ImportError:
                print(
                    "sentence-transformers not installed. "
                    "Using mock embeddings for demonstration."
                )
                self._model = "mock"

    def embed_texts(self, texts: list[str]) -> list[list[float]]:
        """Generate embeddings for a list of texts."""
        self._load_model()

        if self._model == "mock":
            return self._mock_embed(texts)

        embeddings = self._model.encode(
            texts,
            show_progress_bar=False,
            normalize_embeddings=True,
        )
        return embeddings.tolist()

    def embed_query(self, query: str) -> list[float]:
        """Generate embedding for a single query."""
        return self.embed_texts([query])[0]

    @staticmethod
    def _mock_embed(texts: list[str]) -> list[list[float]]:
        """
        Generate deterministic mock embeddings for demonstration.
        Uses simple hashing to produce consistent vectors.
        """
        import math

        embeddings = []
        dim = 384  # Match MiniLM dimensions

        for text in texts:
            # Create a deterministic but content-sensitive vector
            hash_val = int(hashlib.md5(text.encode()).hexdigest(), 16)
            vector = []
            for i in range(dim):
                # Generate pseudo-random but deterministic values
                val = math.sin(hash_val * (i + 1)) * 0.5
                vector.append(val)
            # Normalize
            magnitude = math.sqrt(sum(v ** 2 for v in vector))
            if magnitude > 0:
                vector = [v / magnitude for v in vector]
            embeddings.append(vector)

        return embeddings


class VectorStore:
    """
    A simple in-memory vector store for prototyping.

    For production, replace with ChromaDB, Pinecone, or Weaviate.
    This implementation demonstrates the core operations
    (add, search, delete) without external dependencies.
    """

    def __init__(self, embedder: SimpleEmbedder = None):
        self.embedder = embedder or SimpleEmbedder()
        self.chunks: list[Chunk] = []
        self.embeddings: list[list[float]] = []

    def add_chunks(self, chunks: list[Chunk]):
        """Add chunks to the vector store, computing embeddings."""
        texts = [chunk.content for chunk in chunks]
        new_embeddings = self.embedder.embed_texts(texts)

        for chunk, embedding in zip(chunks, new_embeddings):
            chunk.embedding = embedding
            self.chunks.append(chunk)
            self.embeddings.append(embedding)

        print(
            f"Added {len(chunks)} chunks to vector store. "
            f"Total: {len(self.chunks)} chunks."
        )

    def search(
        self,
        query: str,
        top_k: int = 5,
        filter_metadata: dict = None,
    ) -> list[RetrievalResult]:
        """
        Find the most similar chunks to a query.

        Args:
            query: The search query text.
            top_k: Number of results to return.
            filter_metadata: Optional metadata filter
                (e.g., {"category": "returns"}).

        Returns:
            List of RetrievalResult objects sorted by similarity.
        """
        query_embedding = self.embedder.embed_query(query)

        # Compute cosine similarity with all stored chunks
        scores = []
        for i, (chunk, embedding) in enumerate(
            zip(self.chunks, self.embeddings)
        ):
            # Apply metadata filter if specified
            if filter_metadata:
                match = all(
                    chunk.metadata.get(k) == v
                    for k, v in filter_metadata.items()
                )
                if not match:
                    continue

            similarity = self._cosine_similarity(
                query_embedding, embedding
            )
            scores.append((i, similarity))

        # Sort by similarity (descending) and take top_k
        scores.sort(key=lambda x: x[1], reverse=True)
        top_scores = scores[:top_k]

        results = []
        for rank, (idx, score) in enumerate(top_scores):
            results.append(RetrievalResult(
                chunk=self.chunks[idx],
                score=score,
                rank=rank + 1,
            ))

        return results

    @staticmethod
    def _cosine_similarity(
        vec_a: list[float], vec_b: list[float]
    ) -> float:
        """Compute cosine similarity between two vectors."""
        dot_product = sum(a * b for a, b in zip(vec_a, vec_b))
        magnitude_a = sum(a ** 2 for a in vec_a) ** 0.5
        magnitude_b = sum(b ** 2 for b in vec_b) ** 0.5

        if magnitude_a == 0 or magnitude_b == 0:
            return 0.0

        return dot_product / (magnitude_a * magnitude_b)

    def get_stats(self) -> dict:
        """Return statistics about the vector store."""
        if not self.chunks:
            return {"total_chunks": 0}

        categories = {}
        for chunk in self.chunks:
            cat = chunk.metadata.get("category", "unknown")
            categories[cat] = categories.get(cat, 0) + 1

        return {
            "total_chunks": len(self.chunks),
            "categories": categories,
            "avg_chunk_length": sum(
                len(c.content) for c in self.chunks
            ) // len(self.chunks),
        }

Step 5: RAG Prompt Construction and Generation

class RAGPipeline:
    """
    The complete RAG pipeline: retrieve context, construct prompt,
    generate answer, and return response with citations.

    This is the orchestrator that connects all pipeline stages.
    """

    SYSTEM_PROMPT = textwrap.dedent("""\
        You are Athena's Policy Assistant, a helpful and accurate
        customer service tool. Your job is to answer questions
        about Athena Retail Group's policies and procedures.

        IMPORTANT RULES:
        1. Answer ONLY based on the provided context documents.
        2. If the context does not contain enough information to
           answer the question, say "I don't have enough
           information to answer that question" — do NOT guess.
        3. Always cite your sources using [Source: document_title]
           format.
        4. If the context contains conflicting information,
           acknowledge the conflict and cite both sources.
        5. Be concise but complete. Include specific numbers,
           dates, and conditions from the policy documents.
    """)

    QUERY_TEMPLATE = textwrap.dedent("""\
        CONTEXT DOCUMENTS:
        {context}

        ---

        USER QUESTION: {query}

        Provide a clear, accurate answer based solely on the
        context documents above. Cite your sources.
    """)

    def __init__(
        self,
        vector_store: VectorStore,
        top_k: int = 5,
        llm_client=None,
    ):
        self.vector_store = vector_store
        self.top_k = top_k
        self.llm_client = llm_client

    def _format_context(
        self, results: list[RetrievalResult]
    ) -> str:
        """Format retrieved chunks into a context string."""
        context_parts = []
        for result in results:
            title = result.chunk.metadata.get(
                "title", "Unknown Document"
            )
            category = result.chunk.metadata.get(
                "category", "general"
            )
            last_updated = result.chunk.metadata.get(
                "last_updated", "unknown"
            )

            context_parts.append(
                f"[Document: {title}]\n"
                f"[Category: {category} | "
                f"Last Updated: {last_updated} | "
                f"Relevance Score: {result.score:.3f}]\n\n"
                f"{result.chunk.content}"
            )

        return "\n\n---\n\n".join(context_parts)

    def _construct_prompt(
        self, query: str, context: str
    ) -> list[dict]:
        """Build the chat messages for the LLM."""
        return [
            {"role": "system", "content": self.SYSTEM_PROMPT},
            {
                "role": "user",
                "content": self.QUERY_TEMPLATE.format(
                    context=context, query=query
                ),
            },
        ]

    def query(
        self,
        user_query: str,
        filter_metadata: dict = None,
    ) -> RAGResponse:
        """
        Execute the full RAG pipeline for a user query.

        Steps:
        1. Retrieve relevant chunks from the vector store
        2. Format retrieved context
        3. Construct the augmented prompt
        4. Generate response (or return prompt for review)
        5. Package response with source citations

        Args:
            user_query: The user's question.
            filter_metadata: Optional metadata filter for
                retrieval.

        Returns:
            RAGResponse with answer, sources, and diagnostics.
        """
        # Step 1: Retrieve
        results = self.vector_store.search(
            query=user_query,
            top_k=self.top_k,
            filter_metadata=filter_metadata,
        )

        if not results:
            return RAGResponse(
                answer="No relevant documents found for your query.",
                query=user_query,
            )

        # Step 2: Format context
        context = self._format_context(results)

        # Step 3: Construct prompt
        messages = self._construct_prompt(user_query, context)

        # Step 4: Generate response
        if self.llm_client:
            answer = self._call_llm(messages)
        else:
            # When no LLM client is configured, return a
            # structured summary for demonstration purposes
            answer = self._generate_demo_response(
                user_query, results
            )

        # Step 5: Package response
        sources = [
            {
                "title": r.chunk.metadata.get("title", "Unknown"),
                "category": r.chunk.metadata.get(
                    "category", "unknown"
                ),
                "score": round(r.score, 4),
                "chunk_preview": r.chunk.content[:150] + "...",
                "last_updated": r.chunk.metadata.get(
                    "last_updated", "unknown"
                ),
            }
            for r in results
        ]

        return RAGResponse(
            answer=answer,
            sources=sources,
            query=user_query,
            retrieval_scores=[r.score for r in results],
            metadata={
                "num_chunks_retrieved": len(results),
                "top_score": results[0].score if results else 0,
                "prompt_messages": messages,
            },
        )

    def _call_llm(self, messages: list[dict]) -> str:
        """Call the LLM API to generate a response."""
        try:
            response = self.llm_client.chat.completions.create(
                model="gpt-4",
                messages=messages,
                temperature=0.1,  # Low temperature for factual QA
                max_tokens=1000,
            )
            return response.choices[0].message.content
        except Exception as e:
            return f"LLM generation failed: {e}"

    @staticmethod
    def _generate_demo_response(
        query: str, results: list[RetrievalResult]
    ) -> str:
        """Generate a demonstration response without an LLM."""
        top_result = results[0]
        title = top_result.chunk.metadata.get(
            "title", "Unknown Document"
        )

        return (
            f"Based on the retrieved documents, here is the "
            f"relevant information:\n\n"
            f"From '{title}' "
            f"(relevance: {top_result.score:.3f}):\n\n"
            f"{top_result.chunk.content}\n\n"
            f"[Source: {title}]\n\n"
            f"Note: In production, an LLM would synthesize a "
            f"natural language answer from the {len(results)} "
            f"retrieved document chunks."
        )

Code Explanation: The RAGPipeline class is the orchestrator. Its query method executes the full pipeline: retrieve, format context, construct prompt, generate, and package the response. The system prompt is carefully crafted to keep the model grounded — it instructs the model to answer only from the provided context and to say "I don't know" rather than guess. The _format_context method includes metadata (title, category, last updated date, relevance score) in the context, giving the LLM information to assess source quality. When no LLM client is configured, the _generate_demo_response method returns the raw retrieval results — useful for testing retrieval quality independently of generation quality.

Step 6: Evaluation

class RAGEvaluator:
    """
    Evaluates RAG pipeline quality across retrieval and generation.

    Provides metrics for:
    - Retrieval relevance (do we find the right documents?)
    - Source freshness (are retrieved documents current?)
    - Response coverage (does the answer address the query?)
    """

    @staticmethod
    def evaluate_retrieval(
        results: list[RetrievalResult],
        relevant_doc_ids: list[str] = None,
    ) -> dict:
        """
        Evaluate retrieval quality.

        Args:
            results: Retrieved results from the vector store.
            relevant_doc_ids: Known relevant document IDs
                (for precision/recall calculation). Optional.

        Returns:
            Dictionary of retrieval quality metrics.
        """
        if not results:
            return {"error": "No results to evaluate"}

        scores = [r.score for r in results]
        metrics = {
            "num_results": len(results),
            "top_score": round(max(scores), 4),
            "avg_score": round(sum(scores) / len(scores), 4),
            "min_score": round(min(scores), 4),
            "score_spread": round(max(scores) - min(scores), 4),
        }

        # If we have ground truth, compute precision and recall
        if relevant_doc_ids:
            retrieved_ids = [
                r.chunk.metadata.get("source_doc_id", "")
                for r in results
            ]
            relevant_retrieved = [
                rid for rid in retrieved_ids
                if rid in relevant_doc_ids
            ]
            metrics["precision_at_k"] = round(
                len(relevant_retrieved) / len(results), 4
            )
            metrics["recall_at_k"] = round(
                len(relevant_retrieved) / len(relevant_doc_ids), 4
            ) if relevant_doc_ids else 0

        return metrics

    @staticmethod
    def evaluate_freshness(
        results: list[RetrievalResult],
        staleness_days: int = 90,
    ) -> dict:
        """
        Check if retrieved documents are current.

        Flags documents that haven't been updated within
        the staleness threshold — a critical governance
        metric for Athena's knowledge base.
        """
        now = datetime.now()
        freshness_report = {
            "total_sources": len(results),
            "fresh": 0,
            "stale": 0,
            "unknown": 0,
            "stale_sources": [],
        }

        for result in results:
            last_updated = result.chunk.metadata.get(
                "last_updated"
            )
            if not last_updated:
                freshness_report["unknown"] += 1
                continue

            try:
                update_date = datetime.fromisoformat(last_updated)
                age_days = (now - update_date).days

                if age_days > staleness_days:
                    freshness_report["stale"] += 1
                    freshness_report["stale_sources"].append({
                        "title": result.chunk.metadata.get(
                            "title", "Unknown"
                        ),
                        "last_updated": last_updated,
                        "age_days": age_days,
                    })
                else:
                    freshness_report["fresh"] += 1
            except (ValueError, TypeError):
                freshness_report["unknown"] += 1

        return freshness_report

    @staticmethod
    def evaluate_response_quality(
        response: RAGResponse,
    ) -> dict:
        """
        Basic response quality checks (heuristic-based).

        In production, you would use LLM-as-a-judge or
        the RAGAS framework for more sophisticated evaluation.
        """
        answer = response.answer
        quality = {
            "answer_length": len(answer),
            "has_citation": "[Source:" in answer
                or "[source:" in answer.lower(),
            "has_uncertainty_acknowledgment": any(
                phrase in answer.lower()
                for phrase in [
                    "i don't have enough",
                    "not enough information",
                    "unable to determine",
                    "not clear from",
                ]
            ),
            "num_sources_cited": answer.lower().count("[source:"),
            "num_sources_retrieved": len(response.sources),
        }

        # Quality score: simple heuristic
        score = 0.0
        if quality["has_citation"]:
            score += 0.3
        if quality["answer_length"] > 50:
            score += 0.2
        if quality["answer_length"] < 2000:
            score += 0.1
        if len(response.sources) >= 2:
            score += 0.2
        if response.retrieval_scores and \
                response.retrieval_scores[0] > 0.5:
            score += 0.2

        quality["heuristic_quality_score"] = round(score, 2)

        return quality

Step 7: Putting It All Together

def build_athena_knowledge_base() -> RAGPipeline:
    """
    Build Athena's customer service RAG pipeline with
    sample policy documents.

    In production, documents would be loaded from SharePoint,
    Confluence, or a document management system. Here we use
    sample documents that represent typical Athena policies.
    """

    # ── Sample Athena Policy Documents ───────────────────────

    documents = [
        Document(
            content=textwrap.dedent("""\
                # Athena Retail Group — Return Policy

                ## Standard Returns
                Customers may return items within 30 days of
                purchase for a full refund. Items must be in
                original condition with tags attached. A receipt
                or proof of purchase is required.

                ## Holiday Returns
                For items purchased during the holiday promotional
                period (November 15 through December 31), the
                return window is extended to 45 days from the
                date of purchase. This policy applies to both
                in-store and online purchases.

                ## Final Sale Items
                Items marked as Final Sale are not eligible for
                return or exchange. This includes clearance items
                marked with a yellow tag and items purchased with
                a Final Sale promotional code.

                ## Online Returns
                Online purchases may be returned by mail (prepaid
                label provided) or at any Athena retail location.
                Refunds for mailed returns are processed within
                5-7 business days of receipt.

                Effective Date: January 1, 2026
                Last Reviewed: February 15, 2026
                Policy Owner: Customer Experience Team
            """),
            metadata={
                "category": "returns",
                "last_updated": "2026-02-15",
                "policy_version": "4.2",
                "department": "customer_experience",
            },
        ),
        Document(
            content=textwrap.dedent("""\
                # Athena Retail Group — Shipping Policy

                ## Standard Shipping
                Standard shipping is available for all orders
                and typically arrives within 5-7 business days.
                Standard shipping is free for orders over $75.
                Orders under $75 incur a flat shipping fee of
                $6.95.

                ## Expedited Shipping
                Two-day expedited shipping is available for
                $12.95. Next-day shipping is available for
                $19.95 and must be placed before 2:00 PM EST
                for same-day processing.

                ## International Shipping
                Athena ships to Canada and the United Kingdom.
                International shipping rates are calculated at
                checkout based on package weight and destination.
                Delivery typically takes 10-15 business days.
                Customs duties and taxes are the responsibility
                of the customer.

                ## Order Tracking
                All orders include tracking information, sent
                via email within 24 hours of shipment.
                Customers can also track orders through their
                Athena account dashboard.

                Effective Date: March 1, 2026
                Last Reviewed: March 1, 2026
                Policy Owner: Logistics Team
            """),
            metadata={
                "category": "shipping",
                "last_updated": "2026-03-01",
                "policy_version": "3.1",
                "department": "logistics",
            },
        ),
        Document(
            content=textwrap.dedent("""\
                # Athena Retail Group — Loyalty Program (Athena Rewards)

                ## Program Overview
                Athena Rewards is a free loyalty program open to
                all customers. Members earn 1 point per dollar
                spent. Points are redeemable for discounts on
                future purchases.

                ## Tier Structure
                - **Bronze (0-499 points):** Base earning rate,
                  birthday discount (10% off).
                - **Silver (500-1,499 points):** 1.25x earning
                  rate, free standard shipping, early access
                  to sales.
                - **Gold (1,500+ points):** 1.5x earning rate,
                  free expedited shipping, exclusive Gold
                  member events, dedicated customer service line.

                ## Point Redemption
                Points may be redeemed at a rate of 100 points
                = $5 discount. Minimum redemption is 100 points.
                Points expire 12 months after the last qualifying
                purchase.

                ## Points on Returns
                When an item is returned, the points earned on
                that purchase are deducted from the member's
                balance. If the member's balance is insufficient,
                the balance will go negative and must be restored
                through future purchases.

                Effective Date: January 1, 2026
                Last Reviewed: January 15, 2026
                Policy Owner: Marketing Team
            """),
            metadata={
                "category": "loyalty",
                "last_updated": "2026-01-15",
                "policy_version": "2.0",
                "department": "marketing",
            },
        ),
        Document(
            content=textwrap.dedent("""\
                # Athena Retail Group — Price Match Guarantee

                ## Eligibility
                Athena will match the price of an identical item
                sold by a qualifying competitor. The item must
                be the same brand, model, size, and color.
                The competitor's price must be current and
                verifiable (print ad, website screenshot, or
                live website).

                ## Qualifying Competitors
                Price matching applies to the following
                competitors: Nordstrom, Macy's, Target,
                and Amazon (sold and shipped by Amazon only,
                not third-party sellers).

                ## Exclusions
                Price matching does not apply to: clearance
                items, doorbusters, lightning deals, coupon
                prices, bundled offers, or items sold by
                third-party marketplace sellers. Price matching
                is not available during Athena's Black Friday
                and Cyber Monday promotional periods.

                ## Process
                Customers may request a price match at the
                time of purchase (in-store or online) or within
                14 days of purchase with proof of the
                competitor's lower price.

                Effective Date: June 1, 2025
                Last Reviewed: November 30, 2025
                Policy Owner: Merchandising Team
            """),
            metadata={
                "category": "pricing",
                "last_updated": "2025-11-30",
                "policy_version": "1.3",
                "department": "merchandising",
            },
        ),
        Document(
            content=textwrap.dedent("""\
                # Athena Retail Group — Gift Card Policy

                ## Purchase
                Athena gift cards are available in denominations
                of $25, $50, $100, and $200. Custom amounts
                between $10 and $500 are available online.
                Gift cards can be purchased in-store or at
                athena.com/giftcards.

                ## Redemption
                Gift cards are redeemable at any Athena retail
                location and online at athena.com. Gift cards
                cannot be used to purchase other gift cards.
                Multiple gift cards may be applied to a single
                transaction.

                ## Expiration
                Athena gift cards do not expire. No inactivity
                or service fees will be charged.

                ## Lost or Stolen Cards
                Lost or stolen gift cards can be replaced with
                proof of purchase (receipt or order confirmation).
                Athena is not responsible for lost or stolen cards
                used prior to reporting.

                Effective Date: January 1, 2025
                Last Reviewed: December 1, 2025
                Policy Owner: Finance Team
            """),
            metadata={
                "category": "gift_cards",
                "last_updated": "2025-12-01",
                "policy_version": "2.1",
                "department": "finance",
            },
        ),
    ]

    # ── Build the Pipeline ───────────────────────────────────

    # Initialize components
    preprocessor = DocumentPreprocessor()
    chunker = RecursiveChunker(
        chunk_size=500,
        chunk_overlap=100,
    )
    embedder = SimpleEmbedder()
    vector_store = VectorStore(embedder=embedder)

    # Process and index documents
    all_chunks = []
    for doc in documents:
        processed_doc = preprocessor.preprocess(doc)
        chunks = chunker.chunk_document(processed_doc)
        all_chunks.extend(chunks)
        print(
            f"  Processed '{processed_doc.metadata['title']}': "
            f"{len(chunks)} chunks"
        )

    # Add all chunks to the vector store
    vector_store.add_chunks(all_chunks)

    # Build the RAG pipeline
    pipeline = RAGPipeline(
        vector_store=vector_store,
        top_k=3,
    )

    return pipeline


def demo_athena_rag():
    """
    Demonstrate the Athena RAG pipeline with sample queries.
    """
    print("=" * 60)
    print("ATHENA POLICY CO-PILOT — RAG Pipeline Demo")
    print("=" * 60)

    # Build the knowledge base
    print("\n── Building Knowledge Base ──\n")
    pipeline = build_athena_knowledge_base()

    # Print vector store stats
    stats = pipeline.vector_store.get_stats()
    print(f"\nVector Store Stats: {json.dumps(stats, indent=2)}")

    # ── Sample Queries ───────────────────────────────────────

    queries = [
        "What is the return window for holiday purchases?",
        "How much does expedited shipping cost?",
        "Do gift cards expire?",
        "Will Athena match Amazon's price on a jacket?",
        "How do loyalty points work with returns?",
    ]

    evaluator = RAGEvaluator()

    for i, query in enumerate(queries, 1):
        print(f"\n{'─' * 60}")
        print(f"Query {i}: {query}")
        print(f"{'─' * 60}")

        # Execute RAG pipeline
        response = pipeline.query(query)

        # Display answer
        print(f"\nAnswer:\n{response.answer}")

        # Display sources
        print(f"\nSources ({len(response.sources)}):")
        for src in response.sources:
            print(
                f"  - {src['title']} "
                f"(score: {src['score']}, "
                f"updated: {src['last_updated']})"
            )

        # Evaluate retrieval quality
        retrieval_metrics = evaluator.evaluate_retrieval(
            [
                RetrievalResult(
                    chunk=Chunk(
                        content=src["chunk_preview"],
                        metadata=src,
                    ),
                    score=src["score"],
                    rank=idx + 1,
                )
                for idx, src in enumerate(response.sources)
            ]
        )
        print(f"\nRetrieval Metrics: {json.dumps(retrieval_metrics, indent=2)}")

        # Evaluate freshness
        freshness = evaluator.evaluate_freshness(
            [
                RetrievalResult(
                    chunk=Chunk(
                        content="",
                        metadata={
                            "title": src["title"],
                            "last_updated": src["last_updated"],
                        },
                    ),
                    score=src["score"],
                    rank=idx + 1,
                )
                for idx, src in enumerate(response.sources)
            ]
        )
        if freshness["stale_sources"]:
            print(
                f"\n⚠ STALENESS WARNING: {len(freshness['stale_sources'])} "
                f"source(s) may be outdated:"
            )
            for stale in freshness["stale_sources"]:
                print(
                    f"    '{stale['title']}' — last updated "
                    f"{stale['age_days']} days ago"
                )

    # ── Summary Statistics ───────────────────────────────────

    print(f"\n{'=' * 60}")
    print("PIPELINE SUMMARY")
    print(f"{'=' * 60}")
    print(f"Documents indexed: 5")
    print(f"Total chunks: {stats['total_chunks']}")
    print(f"Avg chunk length: {stats['avg_chunk_length']} chars")
    print(f"Queries processed: {len(queries)}")
    print(
        f"\nNext steps: Connect an LLM client (OpenAI, Anthropic) "
        f"to generate natural language answers from retrieved context."
    )


# ── Run Demo ─────────────────────────────────────────────────

if __name__ == "__main__":
    demo_athena_rag()

Code Explanation: The build_athena_knowledge_base function creates the complete pipeline from sample policy documents. It preprocesses each document (cleaning text, extracting metadata), chunks them using recursive splitting (500-character chunks with 100-character overlap), embeds the chunks, and stores them in the vector store. The demo_athena_rag function runs five sample queries through the pipeline and evaluates retrieval quality and source freshness for each. The freshness evaluator flags any document that hasn't been reviewed in 90+ days — this is the governance mechanism that addresses Ravi's concern about outdated documents being faithfully retrieved and cited.


Production Considerations

Building a working RAG prototype is one thing. Deploying it in production — where it handles thousands of queries per day, with real customers, real consequences, and real costs — is quite another. Here are the considerations Ravi's team navigated when moving Athena's policy co-pilot from prototype to production.

Latency

Users expect sub-second responses. A RAG pipeline adds latency at every stage:

Stage Typical Latency Optimization
Query embedding 20-50ms Use smaller embedding model; batch if possible
Vector search 5-50ms HNSW index; filter by metadata to reduce search space
Re-ranking 50-200ms Limit re-ranking to top-20 candidates
LLM generation 500-3,000ms Stream responses; use faster models for simple queries
Total 575-3,300ms Aim for < 2 seconds end-to-end

LLM generation is the bottleneck. Strategies to reduce generation latency include: using faster models (GPT-4o-mini instead of GPT-4 for routine queries), streaming responses (displaying text as it is generated), and caching frequent queries.

Caching

Many RAG queries are repetitive. "What is the return policy?" is asked hundreds of times. Caching strategies include:

Exact match caching: Cache the response for identical queries. Simple and effective for FAQ-like workloads.

Semantic caching: Cache responses for semantically similar queries. "What's the return policy?" and "How do I return an item?" should hit the same cache entry. Implement by embedding queries and finding near-matches in a cache index.

Chunk-level caching: Cache embeddings for frequently retrieved chunks. This eliminates redundant embedding computation during retrieval.

Cost Optimization

RAG costs have three components:

  1. Embedding costs: $0.02-$0.13 per million tokens (OpenAI pricing as of 2026). Indexing 5,000 documents is inexpensive. Re-embedding on every document update adds up.

  2. LLM generation costs: $1-$30 per million tokens depending on the model. This is the dominant cost. Using smaller models for routine queries and reserving larger models for complex queries reduces costs by 60-80 percent.

  3. Infrastructure costs: Vector database hosting, compute for embedding and re-ranking, monitoring and logging. Managed services (Pinecone, Weaviate Cloud) charge based on data volume and query throughput.

Business Insight: Athena's RAG system processes about 1,200 queries per day. At an average of 500 input tokens (query + context) and 200 output tokens per query, the LLM generation cost using GPT-4o-mini is approximately $0.50 per day — roughly $15 per month. The time savings for customer service agents (35 percent reduction in average handle time, at an average agent cost of $22/hour) produces monthly savings exceeding $40,000. The ROI is not even close. This is a case where the technology cost is trivial compared to the business value. Most RAG deployments in enterprise settings follow this pattern.

Scaling

As the knowledge base grows — from 5,000 documents to 50,000, or from 1,200 queries per day to 12,000 — several components require attention:

  • Vector database scaling: Move from in-memory (ChromaDB) to managed cloud (Pinecone, Weaviate Cloud) for datasets exceeding 1 million chunks. See Chapter 23 for cloud infrastructure decisions.
  • Embedding pipeline: Batch embedding jobs for large document ingestion; incremental updates for ongoing changes.
  • Load balancing: Distribute query traffic across multiple LLM API keys or model endpoints to avoid rate limits.
  • Monitoring: Track retrieval quality metrics over time, alert on degradation, audit for bias in retrieval results.

Knowledge Base Governance

The problem NK identified in the opening scene — a chatbot confidently citing incorrect information — has a governance solution, not just a technical one.

Athena Update: After launching the policy co-pilot, Ravi established a Knowledge Base Governance Process with three components:

  1. Document ownership: Every document in the knowledge base has a named owner responsible for accuracy and currency.
  2. Freshness SLA: Documents must be reviewed by their owner at least every 90 days. The system automatically flags queries that rely on documents past their review date.
  3. Change management: When policies change, the knowledge base update is part of the change management checklist — not an afterthought.

These governance processes reduced the "stale document" error rate from 8 percent to under 1 percent within three months. NK later cited this as an example of how AI governance is not about restricting AI but about making AI trustworthy.


Document Freshness: The Stale Knowledge Problem

Ravi leans against the podium and tells the class what happened six weeks after launch.

"Everything was working beautifully. Agents loved the co-pilot. Handle times dropped. Accuracy on policy questions hit 90 percent. Then we got a complaint. A customer was told they could price-match against Best Buy. Our price-match policy covers Nordstrom, Macy's, Target, and Amazon. Not Best Buy. Never Best Buy."

Tom frowns. "Was the model hallucinating?"

"No. That's the thing. The model was being perfectly faithful. It retrieved a document from November 2024 — an old version of the price-match policy that included Best Buy. We updated the policy in June 2025 to remove Best Buy, but the old document was still sitting in the knowledge base. The RAG system faithfully retrieved it and faithfully cited it."

Professor Okonkwo nods. "This is the governance challenge of RAG. The system only knows what you put in its knowledge base. If you put wrong information in, it will confidently deliver wrong information out — with citations. RAG does not solve the problem of data quality. It concentrates it."

NK writes in her notebook: RAG is only as good as the documents it reads. Garbage in, garbage out — but with citations.

This is why Athena's knowledge base governance process matters as much as its retrieval algorithm. Technical quality without data governance is a liability wearing a lab coat.


From Policy Co-Pilot to Customer-Facing Chatbot

With the internal co-pilot proving its value, NK proposes the next step during her internship review with Ravi.

"The agents are using the co-pilot to look up policies and draft responses. Why not give customers direct access? Cut out the middleman."

Ravi considers this carefully. "Because the middleman is a human being who exercises judgment. A customer asks about returning a damaged item they bought eight months ago. The policy says 30 days. But the agent might say, 'Let me see what I can do for you,' and authorize a goodwill exception. The co-pilot can't make that call."

"So we keep the human in the loop for edge cases," NK says. "But for the 80 percent of questions that are straightforward — 'What are your hours?' 'Is this in stock?' 'How do I track my order?' — the chatbot handles it directly. It only escalates when it's not confident or when the situation requires judgment."

Ravi smiles. "Write that up as a proposal. Include the architecture, the escalation logic, and the risks. If it's solid, we'll pilot it in Q3."

Athena Update: NK's proposal becomes the foundation for Athena's customer-facing chatbot — a project we will follow in subsequent chapters. The architecture combines RAG (for policy grounding) with the escalation patterns and AI governance frameworks covered in Chapters 27-29. The key design decision: the chatbot is always transparent about being an AI, always offers the option to speak with a human, and never makes promises about policy exceptions it cannot authorize.


Chapter Summary

This chapter moved beyond individual prompts to complete AI-powered systems. The progression from hallucinating chatbot to grounded RAG pipeline to production-ready knowledge system illustrates a pattern that applies far beyond customer service:

Identify the failure mode (hallucination, in this case) → Choose the right architecture (RAG) → Engineer each pipeline stage (chunking, embedding, retrieval, generation) → Evaluate rigorously (retrieval quality, faithfulness, freshness) → Govern the data (ownership, freshness SLAs, change management) → Scale deliberately (caching, cost optimization, monitoring).

RAG is not the only architecture for AI-powered workflows, but it is currently the most important one for enterprise applications. Every major cloud provider, every LLM platform, and every enterprise AI vendor either offers RAG capabilities or is building them. Understanding how RAG works — and more importantly, understanding how it fails — is essential knowledge for any business leader deploying AI in production.

The agents and tool-use patterns introduced in this chapter will become increasingly central to enterprise AI. As language models become more capable at reasoning and planning, the range of tasks they can handle autonomously will expand. But the governance question — who is accountable when an agent makes a mistake? — will remain the critical constraint on deployment speed. We will explore this question in depth in Chapters 27 and 29.

Tom packs up his notebook and turns to NK. "So the model was never wrong about the 60-day return policy. It was wrong about Athena's return policy. There's a difference."

NK nods. "And RAG closes that gap. The model doesn't need to know Athena's policy. It just needs to read the document."

"The right document," Tom adds. "The current document."

"That's the governance part," NK says. "The engineering is easy. The governance is hard."

Professor Okonkwo, overhearing, smiles. It is the most MBA thing NK has ever said — and she is entirely correct.


In Chapter 22, we explore no-code and low-code AI platforms that make AI accessible without engineering expertise. In Chapter 23, we examine the cloud AI services and APIs — including managed RAG infrastructure — that power production AI systems at scale. For data privacy considerations in RAG systems (particularly when the knowledge base contains sensitive information), see Chapter 29.