Capstone Project 3: End-to-End Multimodal AI Application
Project Overview
In this capstone project, you will design and build an end-to-end multimodal AI application that processes, understands, and generates content across multiple modalities (text, images, and optionally audio). The application will combine vision-language models, retrieval systems, and agent capabilities into a cohesive product that solves a real-world problem.
This project synthesizes concepts from Chapters 8 (CNNs), 11 (Transformers), 15 (Text Generation), 19 (Prompt Engineering), 20 (Information Retrieval), 22 (Vision Models), 23 (Multimodal Models), 24 (Audio and Speech), 26 (RAG Systems), 27 (AI Agents), 30 (API Design), 32 (Guardrails), 34 (Model Serving), and 37 (Responsible AI).
Estimated Time: 70-90 hours over 5-7 weeks.
Team Size: 2-4 people (recommended due to breadth).
Application Selection
Choose one of the following application tracks (or propose your own with instructor approval):
Track A: Intelligent Document Analyst
An application that processes complex documents containing text, tables, figures, and charts, and answers questions about them. Think financial reports, scientific papers, or technical manuals with diagrams.
Track B: Visual Knowledge Base Assistant
An application that maintains a searchable knowledge base of images and text (e.g., product catalog, medical imaging database, architectural portfolio) and answers natural language queries that may involve visual reasoning.
Track C: Creative Multimodal Content Studio
An application that assists in creating content that spans modalities: generating image descriptions, creating images from text specifications, producing multimedia summaries, and maintaining visual-textual consistency across a project.
Track D: Accessibility and Understanding Assistant
An application that makes visual content accessible: describing images and scenes in detail, extracting text from images (OCR), answering questions about visual content, and optionally generating audio descriptions.
System Architecture
+------------------------------------------------------------------+
| USER INTERFACE |
| |
| +------------------------------------------------------------+ |
| | Web Application (Streamlit / Gradio / React) | |
| | | |
| | [ Text Input ] [ Image Upload ] [ File Upload ] | |
| | [ Audio Input (optional) ] [ Chat History ] | |
| | | |
| | [ Response Area: Text + Images + Citations ] | |
| +------------------------------------------------------------+ |
+----------|----------------------------|--------------------------+
| |
v v
+------------------------------------------------------------------+
| API GATEWAY (FastAPI) |
| |
| POST /chat POST /analyze POST /search |
| POST /upload GET /health GET /metrics |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| ORCHESTRATION / AGENT LAYER |
| |
| +------------------------------------------------------------+ |
| | Agent Controller | |
| | | |
| | - Intent classification (text-only / visual / multimodal) | |
| | - Tool selection and routing | |
| | - Multi-step reasoning (ReAct or Plan-and-Execute) | |
| | - Conversation state management | |
| +------------------------------------------------------------+ |
| | | | | |
| v v v v |
| +-------------+ +-------------+ +------------+ +-------------+ |
| | Vision | | Text | | Retrieval | | Generation | |
| | Pipeline | | Pipeline | | Pipeline | | Pipeline | |
| | | | | | | | | |
| | - Image | | - NER | | - Text | | - Text gen | |
| | captioning | | - Summary | | search | | - Image gen | |
| | - VQA | | - Classify | | - Image | | (optional)| |
| | - OCR | | - Extract | | search | | - Structured| |
| | - Object det | | | | - Hybrid | | output | |
| +-------------+ +-------------+ +------------+ +-------------+ |
+------------------------------------------------------------------+
| |
v v
+---------------------------+ +----------------------------+
| MULTIMODAL KNOWLEDGE BASE | | SAFETY AND GUARDRAILS |
| | | |
| +---------------------+ | | - Input validation |
| | Vector DB | | | - Content moderation |
| | (text + image embs) | | | (text + image) |
| +---------------------+ | | - Output verification |
| +---------------------+ | | - PII detection |
| | Document Store | | | - Bias monitoring |
| | (files, metadata) | | +----------------------------+
| +---------------------+ |
| +---------------------+ |
| | Image Store | |
| | (thumbnails, URLs) | |
| +---------------------+ |
+---------------------------+
Milestone 1: Multimodal Input Processing (Week 1-2)
Objectives
Build pipelines that can process and understand inputs from multiple modalities: text, images, and documents containing both.
Requirements
1.1 Image Understanding Pipeline - Implement image captioning using a vision-language model. Recommended models: - LLaVA 1.6 (open source, strong performance). - InternVL2 (open source, strong on benchmarks). - GPT-4o / Claude via API (highest quality, closed source). - Implement visual question answering (VQA): given an image and a text question, generate an answer. - Implement OCR for extracting text from images: - Use a dedicated OCR engine (Tesseract, EasyOCR, or PaddleOCR) for document images. - Use the vision-language model for scene text and handwritten content. - Handle image preprocessing: resizing, format conversion, EXIF orientation correction. - Support image formats: JPEG, PNG, WebP, TIFF, BMP.
1.2 Document Processing Pipeline
- Process complex documents (PDFs, Word documents) that contain:
- Running text paragraphs.
- Tables (extract to structured format).
- Figures and charts (extract images, generate captions).
- Headers and structural elements.
- For each extracted element, maintain:
- The content (text or image).
- The type (paragraph, table, figure, heading).
- The page number and bounding box (where applicable).
- The relationship to surrounding content (e.g., a figure's caption, a table's title).
- Use appropriate tools: pymupdf or pdfplumber for PDF parsing, python-docx for Word, camelot or tabula-py for table extraction.
1.3 Multimodal Embedding
- Generate embeddings for both text and images in a shared embedding space.
- Use a model such as CLIP (openai/clip-vit-large-patch14) or SigLIP for joint text-image embeddings.
- For text-only content, also generate text embeddings using a sentence transformer (for higher-quality text retrieval).
- Store all embeddings in a vector database with metadata indicating the modality.
1.4 Input Validation - Validate all inputs: - Images: Check file format, size limits (e.g., max 20MB), resolution limits, corrupt file detection. - Text: Length limits, encoding validation. - Documents: Format validation, page count limits, malware scanning (optional). - Return informative error messages for invalid inputs.
Deliverables
- Working image understanding pipeline:
analyze_image(image_path, question=None) -> AnalysisResult. - Working document processing pipeline:
process_document(file_path) -> list[DocumentElement]. - Multimodal embedding generation:
embed(content, modality) -> np.ndarray. - Test suite with at least 20 test images and 5 test documents of varying complexity.
Milestone 2: Multimodal Retrieval System (Week 2-3)
Objectives
Build a retrieval system that can search across text and image content using natural language queries, image queries, or combined queries.
Requirements
2.1 Knowledge Base Ingestion - Build an ingestion pipeline that processes a collection of documents and images into the multimodal knowledge base. - For each ingested item, store: - The original content (or a reference to it). - Text embedding(s). - Image embedding(s), if the content contains images. - CLIP embedding (for cross-modal retrieval). - Metadata: source, type, timestamp, page number, section, tags. - Ingest at least 200 documents or 500 items for your chosen application track.
2.2 Cross-Modal Retrieval - Implement the following retrieval modes: - Text-to-text: Standard text search (dense + BM25 hybrid). - Text-to-image: Find relevant images given a text query (using CLIP embeddings). - Image-to-text: Find relevant text given an image query (using CLIP embeddings). - Image-to-image: Find visually similar images (using CLIP or a vision encoder). - Implement a unified search interface that automatically determines the appropriate retrieval mode(s) based on the query.
2.3 Multi-Modal Fusion - When a query involves both text and images (e.g., "Find documents similar to this one" with an uploaded document containing text and figures), combine retrieval results across modalities. - Implement a fusion strategy: - Reciprocal Rank Fusion across modalities. - Or a weighted combination where weights depend on the query type. - Retrieve results should include both text chunks and relevant images, with appropriate metadata.
2.4 Retrieval Evaluation - Create a test set of at least 30 multimodal queries with relevance judgments. - Include queries that require: - Text-only retrieval (10 queries). - Image retrieval from text description (10 queries). - Cross-modal retrieval, where relevant information spans both text and images (10 queries). - Report: Recall@5, MRR@10, and NDCG@10 for each retrieval mode.
Deliverables
- Multimodal knowledge base with ingested content.
- Unified retrieval interface:
retrieve(query_text=None, query_image=None, top_k=10) -> list[RetrievedItem]. - Retrieval evaluation results by modality.
- Documentation of the fusion strategy and its rationale.
Milestone 3: Agent Capabilities (Week 3-4)
Objectives
Build an agent layer that can reason about multimodal inputs, use tools, and execute multi-step tasks.
Requirements
3.1 Intent Classification and Routing - Implement an intent classifier that categorizes user requests into types: - Question answering: User asks a question about uploaded or stored content. - Search: User wants to find specific content in the knowledge base. - Analysis: User wants detailed analysis of an uploaded image or document. - Comparison: User wants to compare two or more items. - Generation: User wants to create new content based on existing content. - Conversation: General chat or clarification. - Routing can be rule-based, LLM-based (using the LLM to classify intent), or a lightweight classifier.
3.2 Tool Integration
- Define and implement at least four tools that the agent can invoke:
- search_knowledge_base(query, modality_filter, top_k) -- Retrieve from the knowledge base.
- analyze_image(image, question) -- Get detailed analysis of an image.
- extract_text(image_or_document) -- OCR and text extraction.
- compare_items(item_a, item_b, criteria) -- Compare two items along specified criteria.
- summarize(content, style, max_length) -- Summarize text or document content.
- (Optional) generate_image(prompt, style) -- Generate an image using a diffusion model or API.
- Each tool must have a clear schema (name, description, parameters, return type) that the agent can reason about.
3.3 Multi-Step Reasoning - Implement a ReAct-style agent loop: 1. Think: The LLM reasons about what to do next. 2. Act: The LLM selects a tool and provides arguments. 3. Observe: The tool result is returned to the LLM. 4. Repeat until the task is complete or a maximum number of steps is reached. - The agent should be able to handle tasks requiring 2-5 tool calls. For example: - "Compare the performance metrics in charts on page 3 and page 7 of this report" requires: extracting both charts, analyzing each, then comparing. - "Find all products similar to this image and summarize their key features" requires: image search, text retrieval, summarization. - Implement a maximum step limit (e.g., 10) with graceful termination.
3.4 Conversation Management - Maintain conversation state across multiple turns. - Support multimodal conversation history: the agent should remember and reference previously uploaded images and documents within the session. - Implement context window management: summarize or truncate older conversation history when approaching the model's context limit. - Support session persistence (store conversations in a database for later retrieval).
3.5 Error Handling and Recovery - When a tool call fails, the agent should: - Log the error. - Attempt an alternative approach (different tool, different parameters). - Inform the user if recovery is not possible. - When the agent cannot answer a question, it should clearly state this rather than fabricating an answer.
Deliverables
- Agent controller with tool integration and multi-step reasoning.
- At least 10 example multi-turn conversations demonstrating agent capabilities across different intents.
- Documentation of the tool schemas and agent prompts.
- Error handling test cases (at least 5 scenarios).
Milestone 4: Safety, Guardrails, and Content Moderation (Week 4-5)
Objectives
Implement comprehensive safety measures for a multimodal system, addressing risks unique to each modality.
Requirements
4.1 Text Guardrails - Implement the following (reusing and extending techniques from Capstone 1 where applicable): - Input validation and sanitization. - Prompt injection detection. - PII detection and redaction. - Toxicity filtering on outputs. - Topic boundary enforcement (keep responses within the application's domain).
4.2 Image Guardrails
- Implement content moderation for uploaded images:
- NSFW detection using a pre-trained classifier (e.g., Falconsai/nsfw_image_detection or the safety_checker from Stable Diffusion).
- Violence/gore detection.
- Configurable strictness levels.
- For generated image descriptions:
- Verify descriptions are accurate and not hallucinated.
- Ensure descriptions do not contain harmful stereotypes or biases.
4.3 Multimodal Safety - Address cross-modal attacks: - Images with hidden text designed to manipulate the vision-language model (visual prompt injection). - Documents with malicious content embedded in images. - Implement output consistency checks: if the system references an image, verify the textual description matches the visual content.
4.4 Bias Monitoring - Implement basic bias monitoring for the vision-language pipeline: - Track whether image descriptions exhibit demographic biases (e.g., making assumptions about gender or ethnicity from appearance). - Log demographic-related terms in generated descriptions for periodic review. - Document known biases in the models you use and mitigation strategies.
4.5 Privacy - Implement data retention policies: configurable auto-deletion of uploaded content after a specified period. - Ensure uploaded images and documents are not sent to external APIs without user consent (provide a local-only mode). - Log access to stored content for audit purposes.
Deliverables
- Text and image guardrail modules with configurable strictness.
- Adversarial test suite: at least 15 test cases for text attacks, 10 for image attacks, and 5 for cross-modal attacks.
- Bias monitoring report (even if preliminary).
- Privacy policy documentation for the application.
Milestone 5: Deployment and User Interface (Week 5-6)
Objectives
Deploy the complete application with a user-friendly interface and production-grade infrastructure.
Requirements
5.1 User Interface - Build a web-based user interface using Streamlit, Gradio, or React. - The UI must support: - Text input (chat-style). - Image upload (drag-and-drop or file picker). - Document upload (PDF, Word). - Display of multimodal responses: text with inline images, tables, and citations. - Conversation history with the ability to reference previous messages. - Loading indicators and streaming text display. - Feedback mechanism (thumbs up/down on responses).
5.2 API Design
- Implement a complete REST API (FastAPI) with the following endpoints:
- POST /chat -- Send a message (text and/or images) and receive a response.
- POST /upload -- Upload a document or image to the knowledge base.
- POST /search -- Search the knowledge base.
- GET /session/{session_id} -- Retrieve conversation history.
- DELETE /session/{session_id} -- Delete a session and its data.
- GET /health -- Health check.
- GET /metrics -- Application metrics.
- All endpoints should have proper Pydantic schemas, error handling, and authentication.
5.3 Infrastructure - Provide Docker Compose configuration that launches all services: - Application server (FastAPI + agent logic). - Vector database (Qdrant or ChromaDB). - Metadata database (PostgreSQL or SQLite). - (Optional) Local model server (Ollama or vLLM). - Document all environment variables and configuration options. - Implement graceful startup: wait for dependencies (database, model server) before accepting requests.
5.4 Performance - Benchmark end-to-end latency for common operations: - Text-only question: target < 5 seconds. - Image analysis question: target < 10 seconds. - Document ingestion (10-page PDF): target < 60 seconds. - Knowledge base search: target < 2 seconds. - Identify and document performance bottlenecks. - Implement caching where appropriate (e.g., embedding cache for repeated queries).
5.5 Monitoring - Implement structured logging for all operations. - Track and report: - Request volume by type. - Latency breakdown by pipeline stage (vision, retrieval, generation). - Error rates by type and severity. - Model usage and cost estimates. - Storage usage (knowledge base size). - User feedback statistics.
Deliverables
- Working web application accessible via browser.
- Docker Compose configuration for one-command deployment.
- Performance benchmark report.
- API documentation.
Milestone 6: Evaluation, Documentation, and Presentation (Week 6-7)
Objectives
Conduct thorough evaluation and produce professional documentation.
Requirements
6.1 End-to-End Evaluation - Create a comprehensive test suite of at least 50 scenarios covering: - Text-only questions about ingested content (15 scenarios). - Questions about specific images (10 scenarios). - Questions requiring cross-modal reasoning (10 scenarios). - Multi-step tasks requiring agent capabilities (10 scenarios). - Edge cases and error scenarios (5 scenarios). - For each scenario, define: - Input (text, image, or both). - Expected behavior (correct answer, appropriate tool usage, proper error handling). - Evaluation criteria (correctness, relevance, citation accuracy, response quality). - Report results in a structured evaluation table.
6.2 Human Evaluation - Recruit 2-3 evaluators (can be teammates or volunteers). - Have each evaluator interact with the system on 20 tasks (covering all scenario types). - Collect ratings (1-5) for: - Correctness: Is the response factually accurate? - Relevance: Does the response address the user's actual question? - Multimodal integration: Does the system effectively use information from both text and images? - Usability: Is the system easy to use and the interface intuitive? - Trust: Would the user trust this system for their use case? - Compute average scores and inter-rater agreement.
6.3 Comparative Analysis - Compare your system's performance to: - A text-only baseline (same system without image understanding). - A no-agent baseline (direct LLM call without tool use). - (If budget allows) A commercial API baseline (GPT-4o or Claude with vision). - Present comparisons in tables and/or charts.
6.4 Model Card and System Documentation - Produce a model card for the overall system including: - Intended use cases and users. - Input types and limitations. - Known failure modes (with examples). - Bias and fairness considerations. - Environmental impact estimate. - Produce a system documentation package: - Architecture document with diagrams. - Deployment guide. - User guide. - API reference.
6.5 Final Report The report should be 10-15 pages and include:
- Introduction (1 page): Problem statement, chosen application track, motivation.
- System Design (2 pages): Architecture, technology choices, design decisions with rationale.
- Multimodal Processing (2 pages): Image understanding, document processing, multimodal embedding approach.
- Retrieval and Agent (2 pages): Cross-modal retrieval, agent design, tool integration, multi-step reasoning.
- Safety and Guardrails (1-2 pages): Approach, adversarial testing results, bias monitoring.
- Evaluation (2-3 pages): All metrics, human evaluation, comparative analysis.
- Lessons Learned (1 page): What worked, what was harder than expected, what you would do differently.
- Limitations and Future Work (1 page): Honest assessment and concrete next steps.
6.6 Presentation and Demo - Prepare a 20-minute presentation covering the key aspects of the project. - Include a live demo showing: - A text-only interaction. - An image understanding interaction. - A multi-step task requiring agent capabilities. - A cross-modal search. - A guardrail activation. - Be prepared for Q&A.
Deliverables
- Evaluation results (automated and human).
- Model card and system documentation.
- Final report (PDF).
- Presentation slides and live demo (or recorded video).
- Complete code repository with README.
Grading Rubric
| Component | Weight | Criteria |
|---|---|---|
| Multimodal Input Processing | 15% | Image understanding pipeline works correctly, document processing handles multiple formats, multimodal embeddings generated, input validation implemented. |
| Retrieval System | 15% | Cross-modal retrieval works (text-to-image, image-to-text), fusion strategy implemented, retrieval evaluation with metrics. |
| Agent Capabilities | 20% | Agent can handle multi-step tasks, tools are well-designed, conversation management works, error handling is graceful. |
| Safety and Guardrails | 15% | Text and image guardrails implemented, adversarial testing performed, privacy considerations addressed, bias monitoring present. |
| Deployment and UI | 15% | User interface is functional and intuitive, API is well-designed, containerized deployment works, performance benchmarks reported. |
| Evaluation and Documentation | 20% | Comprehensive evaluation with human ratings, comparative analysis, thorough documentation, honest limitations discussion, professional report. |
Grade Thresholds
- A (90-100%): The system demonstrates genuine multimodal understanding across all components. Agent successfully handles complex multi-step tasks. Evaluation is thorough and insightful. Safety measures are comprehensive. Documentation is publication-quality.
- B (80-89%): All milestones completed with good quality. The system handles standard use cases well but may struggle with edge cases. Evaluation covers all required aspects but could go deeper.
- C (70-79%): Core multimodal functionality works but some milestones have gaps. Agent capabilities are limited. Evaluation is incomplete. Documentation is adequate but not thorough.
- D (60-69%): Basic functionality exists but significant milestones are incomplete. System is fragile and handles only simple cases.
- F (<60%): System is non-functional or fails to demonstrate meaningful multimodal capabilities.
Technical Recommendations
Recommended Technology Stack
| Component | Recommended | Alternative |
|---|---|---|
| Vision-Language Model | LLaVA 1.6 (7B) via Ollama | GPT-4o API, Claude API |
| Text LLM | Llama 3.1 8B via Ollama | GPT-4o-mini API |
| Image Embeddings | CLIP ViT-L/14 | SigLIP, OpenCLIP |
| Text Embeddings | bge-large-en-v1.5 |
gte-large, e5-large |
| OCR | PaddleOCR or EasyOCR | Tesseract, DocTR |
| Vector DB | Qdrant | ChromaDB, Weaviate |
| Web UI | Gradio | Streamlit, React |
| API | FastAPI | Flask |
| Agent Framework | Custom (recommended for learning) | LangChain, LlamaIndex |
Compute Requirements
- Minimum: 1 GPU with 12GB VRAM (e.g., RTX 3060) for running CLIP + small LLaVA. Use API-based LLM for text generation.
- Recommended: 1 GPU with 24GB VRAM (e.g., RTX 3090/4090) for running LLaVA 7B + local LLM.
- Cloud option: Use API-based models (GPT-4o, Claude) for both vision and text, requiring no GPU but incurring API costs.
Getting Started
mkdir -p multimodal-capstone/{src,tests,data,configs,docs}
cd multimodal-capstone
python -m venv .venv
source .venv/bin/activate
# Core dependencies
pip install torch torchvision transformers accelerate
pip install fastapi uvicorn gradio
pip install qdrant-client sentence-transformers
pip install open-clip-torch # For CLIP embeddings
pip install Pillow pymupdf python-docx camelot-py[cv]
pip install pydantic structlog pytest httpx
# Optional: local model serving
pip install ollama # Python client for Ollama
Advice
- Start with the simplest multimodal interaction and build up. Get a basic "upload image, ask question, get answer" loop working before adding retrieval, agents, and guardrails.
- Use API-based models initially (GPT-4o, Claude) to validate your architecture, then swap in local models if needed for cost or privacy reasons.
- The agent layer is the hardest part. Budget extra time for it. Start with a simple two-tool agent before adding more tools and complexity.
- Multimodal evaluation is challenging. There is no single metric that captures multimodal system quality. Invest in building a good test suite and conducting honest human evaluation.
- Keep a project journal. Document design decisions, failed experiments, and pivots. This will be invaluable for your final report and will make the difference between a good project and a great one.
- Divide work by pipeline, not by milestone. In a team, have one person own the vision pipeline, one own retrieval, one own the agent, and one own deployment/UI. Each person should understand the full system but be responsible for their component's quality.