Capstone Project 1: Build a Production RAG System with Guardrails

Project Overview

In this capstone project, you will design, build, and deploy a complete Retrieval-Augmented Generation (RAG) system that answers questions from a domain-specific knowledge base. Unlike a simple proof-of-concept, this system must include production-grade features: document ingestion pipelines, hybrid retrieval, guardrails for safety and accuracy, a REST API, observability, and evaluation infrastructure.

This project synthesizes concepts from Chapters 10 (Embeddings), 11 (Transformers), 15 (Text Generation), 19 (Prompt Engineering), 20 (Information Retrieval), 26 (RAG Systems), 28 (Data Engineering), 30 (API Design), 31 (Evaluation), 32 (Guardrails), and 34 (Model Serving).

Estimated Time: 60-80 hours over 4-6 weeks.

Team Size: 1-3 people.


System Architecture

+------------------------------------------------------------------+
|                        CLIENT LAYER                               |
|   [ Web UI ]    [ CLI Tool ]    [ API Client / Postman ]          |
+----------|--------------|----------------|------------------------+
           |              |                |
           v              v                v
+------------------------------------------------------------------+
|                      REST API (FastAPI)                            |
|   /query    /ingest    /feedback    /health    /metrics            |
+----------|--------------|----------------|------------------------+
           |              |                |
           v              v                v
+------------------------------------------------------------------+
|                    ORCHESTRATION LAYER                             |
|                                                                   |
|  +----------------+  +------------------+  +-------------------+  |
|  | INPUT          |  | RETRIEVAL        |  | GENERATION        |  |
|  | GUARDRAILS     |  | PIPELINE         |  | PIPELINE          |  |
|  |                |  |                  |  |                   |  |
|  | - Query valid. |  | - Query rewrite  |  | - Prompt assembly |  |
|  | - PII detect   |  | - Dense retrieval|  | - LLM inference   |  |
|  | - Topic filter |  | - Sparse (BM25)  |  | - Citation gen    |  |
|  | - Rate limit   |  | - Re-ranking     |  | - Streaming resp  |  |
|  +----------------+  | - Fusion         |  +-------------------+  |
|                       +------------------+                        |
|                              |                                    |
|                              v                                    |
|                    +-------------------+                           |
|                    | OUTPUT GUARDRAILS |                           |
|                    | - Faithfulness    |                           |
|                    | - Toxicity filter |                           |
|                    | - Hallucination   |                           |
|                    |   detection       |                           |
|                    | - Format valid.   |                           |
|                    +-------------------+                           |
+------------------------------------------------------------------+
           |                                        |
           v                                        v
+------------------------+           +----------------------------+
| DOCUMENT STORE         |           | OBSERVABILITY              |
|                        |           |                            |
| +--------------------+ |           | - Request logging          |
| | Vector DB          | |           | - Latency tracking         |
| | (Qdrant/ChromaDB)  | |           | - Token usage              |
| +--------------------+ |           | - Retrieval quality        |
| +--------------------+ |           | - User feedback            |
| | BM25 Index         | |           | - Error rates              |
| | (Elasticsearch)    | |           | - Cost tracking            |
| +--------------------+ |           +----------------------------+
| +--------------------+ |
| | Document Metadata  | |
| | (PostgreSQL)       | |
| +--------------------+ |
+------------------------+
           ^
           |
+------------------------+
| INGESTION PIPELINE     |
|                        |
| - PDF/HTML/MD parsing  |
| - Chunking + overlap   |
| - Embedding generation |
| - Metadata extraction  |
| - Deduplication        |
+------------------------+

Milestone 1: Document Ingestion Pipeline (Week 1)

Objectives

Build a robust pipeline that processes documents from multiple formats and stores them in both a vector database and a sparse index.

Requirements

1.1 Document Parsing - Support at least three document formats: PDF, HTML, and Markdown. - Use appropriate parsing libraries (e.g., pymupdf or pdfplumber for PDF, beautifulsoup4 for HTML). - Extract and preserve document structure: titles, headings, paragraphs, tables, and lists. - Handle encoding issues, malformed documents, and edge cases gracefully with logging.

1.2 Chunking Strategy - Implement at least two chunking strategies: - Fixed-size chunking with configurable overlap (e.g., 512 tokens with 64-token overlap). - Semantic chunking that respects document structure (split at heading boundaries, paragraph breaks). - Each chunk must carry metadata: source document ID, chunk index, section heading, page number (if applicable), and ingestion timestamp. - Make chunk size and overlap configurable via environment variables or a configuration file.

1.3 Embedding Generation - Generate dense embeddings using a sentence transformer model (e.g., all-MiniLM-L6-v2 for prototyping, bge-large-en-v1.5 or gte-large for production). - Implement batched embedding generation with progress tracking. - Store embeddings in a vector database (Qdrant, ChromaDB, or Weaviate). - Build a parallel BM25 index (using Elasticsearch, OpenSearch, or a Python-based BM25 library such as rank-bm25).

1.4 Metadata Storage - Store document-level metadata in a relational database (SQLite for development, PostgreSQL for production). - Track: document ID, filename, source URL, ingestion time, number of chunks, format, file hash (for deduplication). - Implement idempotent ingestion: re-ingesting the same document should update rather than duplicate.

Deliverables

  • A working ingestion pipeline invocable via CLI: python ingest.py --source ./docs/
  • Unit tests for parsing, chunking, and embedding generation.
  • A configuration file (config.yaml) with all tunable parameters.
  • A brief document describing chunking strategy decisions and their rationale.

Milestone 2: Retrieval Pipeline (Week 2)

Objectives

Build a hybrid retrieval pipeline that combines dense and sparse retrieval with re-ranking.

Requirements

2.1 Query Processing - Implement query rewriting using an LLM: given a potentially ambiguous user query, generate an improved search query. - Implement query expansion: generate 2-3 variant queries to increase recall. - Support conversation context: for follow-up questions, resolve coreferences using chat history.

2.2 Hybrid Retrieval - Implement dense retrieval using the vector database (cosine similarity or dot product). - Implement sparse retrieval using BM25. - Combine results using Reciprocal Rank Fusion (RRF): score(d) = sum_r 1 / (k + rank_r(d)) where k is typically 60 and the sum is over all retrieval methods. - Retrieve a configurable top-k (default: 20 before re-ranking, 5 after).

2.3 Re-Ranking - Implement a cross-encoder re-ranker (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2) that scores each (query, passage) pair. - The re-ranker should take the top-20 fused results and output the top-5. - Log retrieval latency breakdown: dense search time, sparse search time, fusion time, re-ranking time.

2.4 Retrieval Evaluation - Implement retrieval evaluation using a manually curated set of at least 25 query-document relevance pairs. - Compute metrics: Recall@5, Recall@10, MRR@10, NDCG@10. - Compare: dense-only, sparse-only, hybrid without re-ranking, and hybrid with re-ranking. - Present results in a table showing the impact of each component.

Deliverables

  • A retrieval module with a clean interface: retrieve(query: str, top_k: int, chat_history: list) -> list[RetrievedChunk].
  • Evaluation results table comparing retrieval variants.
  • Latency benchmarks for the retrieval pipeline.

Milestone 3: Generation Pipeline (Week 3)

Objectives

Build the generation component that synthesizes answers from retrieved context, with proper citation and streaming support.

Requirements

3.1 Prompt Engineering - Design a system prompt that instructs the LLM to: - Answer based only on the provided context. - Cite sources using bracketed references (e.g., [1], [2]). - State "I don't have enough information to answer this question" when the context is insufficient. - Maintain a professional, helpful tone. - Implement prompt templates using Jinja2 or a similar templating engine. - Support configurable prompt variants for A/B testing.

3.2 LLM Integration - Support at least two LLM backends: - A cloud API (OpenAI GPT-4, Anthropic Claude, or equivalent). - A local model served via vLLM or Ollama (e.g., Llama 3.1 8B, Mistral 7B). - Implement a clean abstraction layer so backends can be swapped via configuration. - Support streaming responses (server-sent events).

3.3 Citation Generation - Each answer must include inline citations referencing specific retrieved chunks. - After the answer, include a "Sources" section listing the cited documents with their metadata (title, page number, etc.). - Implement citation verification: check that cited chunk IDs actually exist in the retrieved results.

3.4 Context Window Management - Implement intelligent context truncation: when retrieved chunks exceed the model's context window, prioritize higher-ranked chunks. - Track and log token usage for each request (prompt tokens, completion tokens, total tokens). - Estimate cost per request.

Deliverables

  • A generation module: generate(query: str, context: list[RetrievedChunk], chat_history: list) -> GeneratedAnswer.
  • Streaming endpoint that yields answer tokens incrementally.
  • Token usage tracking and cost estimation.

Milestone 4: Guardrails and Safety (Week 4)

Objectives

Implement comprehensive input and output guardrails to ensure safe, accurate, and on-topic responses.

Requirements

4.1 Input Guardrails - Query validation: Reject empty queries, queries exceeding maximum length, and queries that consist solely of special characters. - PII detection: Detect and optionally redact personally identifiable information (email addresses, phone numbers, SSNs) from user queries before they are sent to the LLM. Use regex patterns and/or a NER model. - Topic filtering: Implement a classifier (can be rule-based or ML-based) that detects and rejects queries outside the system's domain scope. - Injection detection: Detect prompt injection attempts (e.g., "ignore your instructions and...") using pattern matching and/or a specialized classifier. - Rate limiting: Implement per-user rate limits (e.g., 10 requests per minute).

4.2 Output Guardrails - Faithfulness check: Implement a check that verifies the generated answer is supported by the retrieved context. Approaches include: - NLI-based: Use a Natural Language Inference model to check entailment between context and answer. - LLM-based: Ask a separate LLM call to verify faithfulness. - Toxicity filtering: Use a toxicity classifier (e.g., unitary/toxic-bert) to detect and block toxic outputs. - Hallucination detection: Flag answers where the model generates specific claims (names, numbers, dates) that do not appear in the retrieved context. - Format validation: Ensure the output conforms to the expected format (includes citations, does not exceed maximum length).

4.3 Fallback Behavior - When guardrails are triggered, return a graceful, informative error message rather than a generic error. - Log all guardrail activations with the triggering input/output for later review. - Implement a configurable guardrail strictness level (strict, moderate, permissive).

Deliverables

  • Input and output guardrail modules with clear interfaces.
  • A test suite of at least 30 adversarial inputs (prompt injections, PII-containing queries, off-topic queries, toxic prompts).
  • Documentation of each guardrail's approach, expected false positive rate, and failure modes.

Milestone 5: REST API and Monitoring (Week 5)

Objectives

Wrap the entire system in a production-quality REST API with monitoring and observability.

Requirements

5.1 API Design - Build the API using FastAPI with the following endpoints: - POST /query -- Submit a question and receive an answer with sources. - POST /query/stream -- Submit a question and receive a streaming response (SSE). - POST /ingest -- Upload a document for ingestion. - POST /feedback -- Submit user feedback (thumbs up/down, correction) for a previous response. - GET /health -- Health check endpoint. - GET /metrics -- Prometheus-compatible metrics endpoint. - Implement proper request/response schemas using Pydantic models. - Include OpenAPI documentation (automatically generated by FastAPI). - Support API key authentication.

5.2 Observability - Log every request with: request ID, timestamp, query, retrieval latency, generation latency, total latency, token count, guardrail outcomes, and model used. - Implement structured logging (JSON format) using Python's logging module or structlog. - Track the following metrics: - Request rate (requests per minute). - Latency percentiles (p50, p95, p99). - Error rate by type (guardrail triggers, LLM errors, retrieval failures). - Token usage and estimated cost. - Feedback scores over time. - Optionally integrate with Prometheus and Grafana for dashboards.

5.3 Containerization - Provide a Dockerfile for the application. - Provide a docker-compose.yml that launches the API, vector database, and any other dependencies. - Include environment variable configuration for all secrets and model paths.

Deliverables

  • A fully functional FastAPI application.
  • Docker configuration for one-command deployment.
  • API documentation (auto-generated via Swagger/OpenAPI).
  • A monitoring dashboard or at least structured log output demonstrating the metrics being tracked.

Milestone 6: Evaluation and Documentation (Week 6)

Objectives

Rigorously evaluate the end-to-end system and produce comprehensive documentation.

Requirements

6.1 End-to-End Evaluation - Create a test set of at least 50 questions with ground-truth answers and source documents. - Evaluate with the following metrics: - Answer quality: Human evaluation on a 1-5 scale for correctness, completeness, and clarity. Also compute automated metrics (ROUGE-L, BERTScore against reference answers). - Faithfulness: Fraction of answers that are fully supported by retrieved context (manual + automated NLI check). - Citation accuracy: Fraction of citations that correctly point to relevant source chunks. - Retrieval quality: Recall@5 and MRR@10 on the test set. - Latency: p50 and p95 end-to-end latency. - Guardrail effectiveness: Precision and recall of guardrails on the adversarial test set.

6.2 Ablation Study - Measure the impact of each component by disabling it and re-evaluating: - Without query rewriting. - Without re-ranking. - Without output guardrails. - Dense-only vs. hybrid retrieval. - Present results in a comparison table.

6.3 Documentation - System architecture document with diagrams. - API reference (auto-generated plus any additional notes). - Deployment guide (local development, Docker, and cloud deployment instructions). - Known limitations and future work.

Deliverables

  • Evaluation report with all metrics, tables, and analysis.
  • Ablation study results.
  • Complete documentation package.

Grading Rubric

Component Weight Criteria
Document Ingestion 15% Handles multiple formats, robust chunking, metadata tracking, deduplication, idempotent operation.
Retrieval Pipeline 20% Hybrid retrieval implemented correctly, re-ranking improves results, query rewriting works, retrieval metrics reported.
Generation Pipeline 15% Accurate answers with proper citations, streaming support, multiple backend support, context management.
Guardrails 20% Comprehensive input/output guardrails, adversarial testing, graceful fallback, configurable strictness.
API and Monitoring 15% Clean API design, proper authentication, structured logging, metrics tracking, containerized deployment.
Evaluation and Documentation 15% Thorough evaluation with meaningful test set, ablation study, clear documentation, honest assessment of limitations.

Grade Thresholds

  • A (90-100%): All milestones completed with high quality. System handles edge cases gracefully. Evaluation is thorough with insightful analysis. Code is clean, well-tested, and well-documented.
  • B (80-89%): All milestones completed. Minor gaps in edge case handling or evaluation depth. Good code quality with adequate testing.
  • C (70-79%): Core functionality works but some milestones have significant gaps. Limited evaluation or testing. Documentation is incomplete.
  • D (60-69%): Basic RAG pipeline works but lacks guardrails, monitoring, or proper evaluation. Code quality issues.
  • F (<60%): System is incomplete or non-functional. Major components missing.

Technical Recommendations

  • Framework: FastAPI + Uvicorn
  • Vector DB: Qdrant (Docker) or ChromaDB (embedded)
  • Sparse Index: Elasticsearch or rank-bm25 (simpler)
  • Embedding Model: BAAI/bge-large-en-v1.5 or Alibaba-NLP/gte-large-en-v1.5
  • Re-ranker: cross-encoder/ms-marco-MiniLM-L-6-v2
  • LLM: OpenAI GPT-4o-mini (cloud), Llama 3.1 8B via Ollama (local)
  • Database: PostgreSQL (via Docker) or SQLite (development)
  • Monitoring: structlog + Prometheus client library
  • Testing: pytest + pytest-asyncio

Getting Started

# Clone the starter template (if provided) or create project structure
mkdir -p rag-capstone/{src,tests,docs,data,configs}
cd rag-capstone
python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows
pip install fastapi uvicorn qdrant-client sentence-transformers \
    transformers torch pydantic structlog pytest httpx \
    pdfplumber beautifulsoup4 jinja2 rank-bm25

Advice

  • Start with the simplest version of each component and iterate. A working end-to-end system with basic components is far more valuable than a half-finished system with advanced components.
  • Write tests as you go, not at the end.
  • Use configuration files and environment variables from the start. Hardcoded values are the enemy of iterative improvement.
  • Keep a development log documenting decisions, experiments, and results. This will form the basis of your final documentation.