(n_out, n_in) - **a**^[l-1] has shape **(64, 16)** --- (n_in, batch_size) - **b**^[l] has shape **(32, 1)** --- (n_out, 1), broadcast across the 16 columns - **z**^[l] has shape **(32, 16)** --- (n_out, batch_size) → Quiz: Neural Networks from Scratch
1. ML/AI Research Scientist
Focus: Advance the state of the art through novel algorithms and architectures. - Skills: Deep mathematical foundations, experimental design, scientific writing. - Trajectory: PhD → research lab → principal researcher → research director. → Chapter 40: The Future of AI Engineering
1. Talent
The local job market (a mid-sized city in the southeastern United States) had limited AI/ML talent. - Competing for experienced ML engineers against major tech companies and well-funded startups was difficult. - Existing engineering staff had strong software skills but limited ML experience. → Case Study 2: Building an AI Engineering Team from Scratch
1.1 Data Collection
Collect raw domain text from at least two different sources. - Document each source: URL, license/terms of use, approximate size, collection method. - Collect at least 50,000 raw text passages (before filtering). → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
1.1 Document Parsing
Support at least three document formats: PDF, HTML, and Markdown. - Use appropriate parsing libraries (e.g., `pymupdf` or `pdfplumber` for PDF, `beautifulsoup4` for HTML). - Extract and preserve document structure: titles, headings, paragraphs, tables, and lists. - Handle encoding issues, malformed → Capstone Project 1: Build a Production RAG System with Guardrails
1.1 Image Understanding Pipeline
Implement image captioning using a vision-language model. Recommended models: - LLaVA 1.6 (open source, strong performance). - InternVL2 (open source, strong on benchmarks). - GPT-4o / Claude via API (highest quality, closed source). - Implement visual question answering (VQA): given an image and a → Capstone Project 3: End-to-End Multimodal AI Application
1.2 Chunking Strategy
Implement at least two chunking strategies: - Fixed-size chunking with configurable overlap (e.g., 512 tokens with 64-token overlap). - Semantic chunking that respects document structure (split at heading boundaries, paragraph breaks). - Each chunk must carry metadata: source document ID, chunk inde → Capstone Project 1: Build a Production RAG System with Guardrails
1.2 Data Cleaning and Filtering
Remove duplicates (exact and near-duplicate using MinHash or similar). - Filter for quality using at least two heuristics: - Minimum length (e.g., at least 50 words). - Language detection (ensure all text is in the target language). - Perplexity filtering: use a reference language model to remove te → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
1.2 Document Processing Pipeline
Process complex documents (PDFs, Word documents) that contain: - Running text paragraphs. - Tables (extract to structured format). - Figures and charts (extract images, generate captions). - Headers and structural elements. - For each extracted element, maintain: - The content (text or image). - The → Capstone Project 3: End-to-End Multimodal AI Application
1.3 Embedding Generation
Generate dense embeddings using a sentence transformer model (e.g., `all-MiniLM-L6-v2` for prototyping, `bge-large-en-v1.5` or `gte-large` for production). - Implement batched embedding generation with progress tracking. - Store embeddings in a vector database (Qdrant, ChromaDB, or Weaviate). - Buil → Capstone Project 1: Build a Production RAG System with Guardrails
1.3 Instruction Dataset Creation
Create an instruction-tuning dataset in the conversational format expected by your chosen base model. Each example should include: - A system message (optional, defining the domain expert role). - A user instruction/question. - An assistant response. - Create examples using a combination of: - **Exi → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
1.3 Multimodal Embedding
Generate embeddings for both text and images in a shared embedding space. - Use a model such as CLIP (`openai/clip-vit-large-patch14`) or SigLIP for joint text-image embeddings. - For text-only content, also generate text embeddings using a sentence transformer (for higher-quality text retrieval). - → Capstone Project 3: End-to-End Multimodal AI Application
1.4 Data Format
Store data in JSONL format with the following schema: ```json { "messages": [ {"role": "system", "content": "You are an expert in [domain]..."}, {"role": "user", "content": "What is..."}, {"role": "assistant", "content": "Based on..."} ], "source": "pubmed_qa", "quality_score": 0.92 } ``` - Create a → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
1.4 Input Validation
Validate all inputs: - Images: Check file format, size limits (e.g., max 20MB), resolution limits, corrupt file detection. - Text: Length limits, encoding validation. - Documents: Format validation, page count limits, malware scanning (optional). - Return informative error messages for invalid input → Capstone Project 3: End-to-End Multimodal AI Application
1.4 Metadata Storage
Store document-level metadata in a relational database (SQLite for development, PostgreSQL for production). - Track: document ID, filename, source URL, ingestion time, number of chunks, format, file hash (for deduplication). - Implement idempotent ingestion: re-ingesting the same document should upd → Capstone Project 1: Build a Production RAG System with Guardrails
2. Infrastructure
No GPU compute infrastructure existed. - The data warehouse was a traditional SQL Server-based system optimized for reporting, not for ML training workloads. - No experiment tracking, model registry, or model serving infrastructure was in place. → Case Study 2: Building an AI Engineering Team from Scratch
2. ML/AI Engineer
Focus: Build and deploy production ML systems. - Skills: Software engineering, MLOps, system design, debugging at scale. - Trajectory: SDE → ML engineer → senior/staff ML engineer → engineering manager. → Chapter 40: The Future of AI Engineering
2.1 Base Model Selection
Choose an open-source base model. Recommended options: - **Llama 3.1 8B Instruct**: Strong baseline, well-documented. - **Mistral 7B Instruct v0.3**: Efficient architecture with sliding window attention. - **Gemma 2 9B Instruct**: Strong performance per parameter. - Document your selection rationale → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
2.1 Knowledge Base Ingestion
Build an ingestion pipeline that processes a collection of documents and images into the multimodal knowledge base. - For each ingested item, store: - The original content (or a reference to it). - Text embedding(s). - Image embedding(s), if the content contains images. - CLIP embedding (for cross-m → Capstone Project 3: End-to-End Multimodal AI Application
2.1 Query Processing
Implement query rewriting using an LLM: given a potentially ambiguous user query, generate an improved search query. - Implement query expansion: generate 2-3 variant queries to increase recall. - Support conversation context: for follow-up questions, resolve coreferences using chat history. → Capstone Project 1: Build a Production RAG System with Guardrails
2.2 Cross-Modal Retrieval
Implement the following retrieval modes: - **Text-to-text**: Standard text search (dense + BM25 hybrid). - **Text-to-image**: Find relevant images given a text query (using CLIP embeddings). - **Image-to-text**: Find relevant text given an image query (using CLIP embeddings). - **Image-to-image**: F → Capstone Project 3: End-to-End Multimodal AI Application
2.2 Hybrid Retrieval
Implement dense retrieval using the vector database (cosine similarity or dot product). - Implement sparse retrieval using BM25. - Combine results using Reciprocal Rank Fusion (RRF): `score(d) = sum_r 1 / (k + rank_r(d))` where k is typically 60 and the sum is over all retrieval methods. - Retrieve → Capstone Project 1: Build a Production RAG System with Guardrails
2.2 LoRA Configuration
Implement fine-tuning using the HuggingFace `peft` and `trl` libraries. - Configure LoRA with the following as starting points (then experiment): - Rank (r): 8, 16, 32, 64. - Alpha: 16, 32 (typically 2x rank). - Target modules: attention projections (`q_proj`, `k_proj`, `v_proj`, `o_proj`) and optio → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
2.3 Multi-Modal Fusion
When a query involves both text and images (e.g., "Find documents similar to this one" with an uploaded document containing text and figures), combine retrieval results across modalities. - Implement a fusion strategy: - Reciprocal Rank Fusion across modalities. - Or a weighted combination where wei → Capstone Project 3: End-to-End Multimodal AI Application
2.3 Re-Ranking
Implement a cross-encoder re-ranker (e.g., `cross-encoder/ms-marco-MiniLM-L-6-v2`) that scores each (query, passage) pair. - The re-ranker should take the top-20 fused results and output the top-5. - Log retrieval latency breakdown: dense search time, sparse search time, fusion time, re-ranking time → Capstone Project 1: Build a Production RAG System with Guardrails
2.3 Training Configuration
Optimizer: AdamW with weight decay 0.01. - Learning rate: Sweep over {5e-6, 1e-5, 2e-5, 5e-5} with cosine schedule. - Warmup: 10% of total steps. - Batch size: Effective batch size of 16-64 (use gradient accumulation as needed). - Maximum sequence length: 2048 tokens (or 4096 if resources allow). - → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
2.4 Experiment Tracking
Track all experiments using Weights & Biases (wandb) or MLflow. - For each run, log: hyperparameters, training loss curve, validation loss curve, learning rate schedule, GPU utilization, peak memory, training time. - Run at least 6 experiments varying: - LoRA rank (at least 3 values). - Learning rat → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
2.4 Retrieval Evaluation
Implement retrieval evaluation using a manually curated set of at least 25 query-document relevance pairs. - Compute metrics: Recall@5, Recall@10, MRR@10, NDCG@10. - Compare: dense-only, sparse-only, hybrid without re-ranking, and hybrid with re-ranking. - Present results in a table showing the impa → Capstone Project 1: Build a Production RAG System with Guardrails
2.5 Training Best Practices
Implement early stopping based on validation loss. - Save checkpoints at regular intervals. - Use gradient clipping (max norm 1.0). - Monitor for catastrophic forgetting by periodically evaluating on a general-purpose benchmark (e.g., a subset of MMLU). → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
Focus: Build the platforms, frameworks, and tooling that ML teams depend on. - Skills: Distributed systems, GPU programming, compiler optimization, cloud architecture. - Trajectory: Systems engineer → AI infra engineer → architect → VP of engineering. → Chapter 40: The Future of AI Engineering
3. Organizational Readiness
Business stakeholders had vague expectations ("We need AI") without specific, well-defined use cases. - Regulatory compliance (banking regulators require model risk management, explainability, and audit trails for models used in lending decisions) imposed constraints that most AI tutorials and blog → Case Study 2: Building an AI Engineering Team from Scratch
3.1 Automated Evaluation
Evaluate on the held-out test set using: - **Perplexity**: Compare base model vs. fine-tuned model on domain text. - **Generation quality**: ROUGE-1, ROUGE-L, and BERTScore against reference answers. - **Exact match / F1**: For extractive QA-style questions. - **Domain terminology accuracy**: Check → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
3.1 Intent Classification and Routing
Implement an intent classifier that categorizes user requests into types: - **Question answering**: User asks a question about uploaded or stored content. - **Search**: User wants to find specific content in the knowledge base. - **Analysis**: User wants detailed analysis of an uploaded image or doc → Capstone Project 3: End-to-End Multimodal AI Application
3.1 Prompt Engineering
Design a system prompt that instructs the LLM to: - Answer based only on the provided context. - Cite sources using bracketed references (e.g., [1], [2]). - State "I don't have enough information to answer this question" when the context is insufficient. - Maintain a professional, helpful tone. - Im → Capstone Project 1: Build a Production RAG System with Guardrails
3.2 Domain-Specific Benchmark
Create or adapt a domain-specific benchmark with at least 100 questions spanning: - Factual recall (e.g., "What is the standard treatment for X?"). - Reasoning (e.g., "Given these symptoms, what is the most likely diagnosis?"). - Summarization (e.g., "Summarize the key findings of this report."). - → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
3.2 LLM Integration
Support at least two LLM backends: - A cloud API (OpenAI GPT-4, Anthropic Claude, or equivalent). - A local model served via vLLM or Ollama (e.g., Llama 3.1 8B, Mistral 7B). - Implement a clean abstraction layer so backends can be swapped via configuration. - Support streaming responses (server-sent → Capstone Project 1: Build a Production RAG System with Guardrails
3.2 Tool Integration
Define and implement at least four tools that the agent can invoke: - `search_knowledge_base(query, modality_filter, top_k)` -- Retrieve from the knowledge base. - `analyze_image(image, question)` -- Get detailed analysis of an image. - `extract_text(image_or_document)` -- OCR and text extraction. - → Capstone Project 3: End-to-End Multimodal AI Application
3.3 Citation Generation
Each answer must include inline citations referencing specific retrieved chunks. - After the answer, include a "Sources" section listing the cited documents with their metadata (title, page number, etc.). - Implement citation verification: check that cited chunk IDs actually exist in the retrieved r → Capstone Project 1: Build a Production RAG System with Guardrails
3.3 Human Evaluation
Conduct human evaluation on at least 50 test examples. - For each example, a human rater (you, a teammate, or a recruited evaluator) scores responses from the base model and the fine-tuned model on: - **Correctness** (1-5): Is the information factually accurate? - **Helpfulness** (1-5): Does the res → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
3.3 Multi-Step Reasoning
Implement a ReAct-style agent loop: 1. **Think**: The LLM reasons about what to do next. 2. **Act**: The LLM selects a tool and provides arguments. 3. **Observe**: The tool result is returned to the LLM. 4. Repeat until the task is complete or a maximum number of steps is reached. - The agent should → Capstone Project 3: End-to-End Multimodal AI Application
3.4 Context Window Management
Implement intelligent context truncation: when retrieved chunks exceed the model's context window, prioritize higher-ranked chunks. - Track and log token usage for each request (prompt tokens, completion tokens, total tokens). - Estimate cost per request. → Capstone Project 1: Build a Production RAG System with Guardrails
3.4 Conversation Management
Maintain conversation state across multiple turns. - Support multimodal conversation history: the agent should remember and reference previously uploaded images and documents within the session. - Implement context window management: summarize or truncate older conversation history when approaching → Capstone Project 3: End-to-End Multimodal AI Application
3.4 Safety and Robustness Evaluation
Test for: - Hallucination rate: Fraction of responses containing fabricated information. - Refusal appropriateness: Does the model refuse to answer questions outside its expertise? - Adversarial robustness: Test with intentionally misleading or adversarial prompts. - General capability retention: Ev → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
3.5 Error Analysis
Manually categorize errors from the test set into types: - Factual errors (wrong information). - Incomplete answers (missing key details). - Hallucinations (fabricated information). - Format errors (wrong structure or style). - Refusal errors (refuses a legitimate question). - For each error type, p → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
3.5 Error Handling and Recovery
When a tool call fails, the agent should: - Log the error. - Attempt an alternative approach (different tool, different parameters). - Inform the user if recovery is not possible. - When the agent cannot answer a question, it should clearly state this rather than fabricating an answer. → Capstone Project 3: End-to-End Multimodal AI Application
3c. Client-Side Optimization
**Speculative prefetch**: The IDE plugin prefetched completions as the user typed, predicting likely pause points. - **Debouncing**: Requests were only sent after a 50ms typing pause to avoid unnecessary calls during fast typing. - **Caching**: Recent completions were cached client-side and reused i → Case Study 2: Optimizing Inference Latency for Production
4. AI Product Manager
Focus: Translate AI capabilities into user-facing products. - Skills: Product sense, technical literacy, user research, business strategy. - Trajectory: PM → AI PM → director of AI products → VP of product. → Chapter 40: The Future of AI Engineering
**Query validation**: Reject empty queries, queries exceeding maximum length, and queries that consist solely of special characters. - **PII detection**: Detect and optionally redact personally identifiable information (email addresses, phone numbers, SSNs) from user queries before they are sent to → Capstone Project 1: Build a Production RAG System with Guardrails
4.1 Merge and Export
Merge the LoRA adapter weights into the base model to create a standalone model. - Save the merged model in HuggingFace format. - Verify that the merged model produces identical outputs to the adapter-based model. → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
4.1 Text Guardrails
Implement the following (reusing and extending techniques from Capstone 1 where applicable): - Input validation and sanitization. - Prompt injection detection. - PII detection and redaction. - Toxicity filtering on outputs. - Topic boundary enforcement (keep responses within the application's domain → Capstone Project 3: End-to-End Multimodal AI Application
4.2 Image Guardrails
Implement content moderation for uploaded images: - NSFW detection using a pre-trained classifier (e.g., `Falconsai/nsfw_image_detection` or the `safety_checker` from Stable Diffusion). - Violence/gore detection. - Configurable strictness levels. - For generated image descriptions: - Verify descript → Capstone Project 3: End-to-End Multimodal AI Application
4.2 Output Guardrails
**Faithfulness check**: Implement a check that verifies the generated answer is supported by the retrieved context. Approaches include: - NLI-based: Use a Natural Language Inference model to check entailment between context and answer. - LLM-based: Ask a separate LLM call to verify faithfulness. - * → Capstone Project 1: Build a Production RAG System with Guardrails
4.2 Quantization Methods
Apply at least two of the following quantization methods: - **GPTQ** (4-bit, 128 group size): Use the `auto-gptq` library with a calibration dataset of 128-256 examples from your training data. - **AWQ** (4-bit): Use the `autoawq` library. Compare with GPTQ on quality and speed. - **GGUF** (for llam → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
4.3 Fallback Behavior
When guardrails are triggered, return a graceful, informative error message rather than a generic error. - Log all guardrail activations with the triggering input/output for later review. - Implement a configurable guardrail strictness level (strict, moderate, permissive). → Capstone Project 1: Build a Production RAG System with Guardrails
4.3 Multimodal Safety
Address cross-modal attacks: - Images with hidden text designed to manipulate the vision-language model (visual prompt injection). - Documents with malicious content embedded in images. - Implement output consistency checks: if the system references an image, verify the textual description matches t → Capstone Project 3: End-to-End Multimodal AI Application
4.3 Quality-Speed Tradeoff Analysis
Create a plot showing: x-axis = inference speed (tokens/sec), y-axis = quality metric (e.g., average score), with each point labeled by quantization method and bit width. - Identify the best quantization method for your use case (the one offering the best quality-speed tradeoff). - Document any qual → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
4.4 Bias Monitoring
Implement basic bias monitoring for the vision-language pipeline: - Track whether image descriptions exhibit demographic biases (e.g., making assumptions about gender or ethnicity from appearance). - Log demographic-related terms in generated descriptions for periodic review. - Document known biases → Capstone Project 3: End-to-End Multimodal AI Application
4.5 Privacy
Implement data retention policies: configurable auto-deletion of uploaded content after a specified period. - Ensure uploaded images and documents are not sent to external APIs without user consent (provide a local-only mode). - Log access to stored content for audit purposes. → Capstone Project 3: End-to-End Multimodal AI Application
5. AI Safety and Alignment Researcher
Focus: Ensure AI systems behave as intended and do not cause harm. - Skills: Formal methods, game theory, interpretability, ethics, policy. - Trajectory: Research assistant → researcher → research lead → head of safety. → Chapter 40: The Future of AI Engineering
5.1 API Design
Build the API using FastAPI with the following endpoints: - `POST /query` -- Submit a question and receive an answer with sources. - `POST /query/stream` -- Submit a question and receive a streaming response (SSE). - `POST /ingest` -- Upload a document for ingestion. - `POST /feedback` -- Submit use → Capstone Project 1: Build a Production RAG System with Guardrails
5.1 Serving Engine Setup
Deploy the model using vLLM (recommended) or HuggingFace Text Generation Inference (TGI). - Configure: - Tensor parallelism (if multiple GPUs are available). - Maximum model length. - GPU memory utilization target (e.g., 90%). - Maximum number of concurrent requests. - Verify the model loads and gen → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
5.1 User Interface
Build a web-based user interface using Streamlit, Gradio, or React. - The UI must support: - Text input (chat-style). - Image upload (drag-and-drop or file picker). - Document upload (PDF, Word). - Display of multimodal responses: text with inline images, tables, and citations. - Conversation histor → Capstone Project 3: End-to-End Multimodal AI Application
5.2 API Design
Implement a complete REST API (FastAPI) with the following endpoints: - `POST /chat` -- Send a message (text and/or images) and receive a response. - `POST /upload` -- Upload a document or image to the knowledge base. - `POST /search` -- Search the knowledge base. - `GET /session/{session_id}` -- Re → Capstone Project 3: End-to-End Multimodal AI Application
5.2 API Layer
Build a FastAPI wrapper around the serving engine with endpoints: - `POST /generate` -- Single completion request. - `POST /generate/stream` -- Streaming completion (SSE). - `POST /chat` -- Multi-turn chat completion (handles message formatting). - `GET /health` -- Health check (model loaded, GPU av → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
5.2 Observability
Log every request with: request ID, timestamp, query, retrieval latency, generation latency, total latency, token count, guardrail outcomes, and model used. - Implement structured logging (JSON format) using Python's `logging` module or `structlog`. - Track the following metrics: - Request rate (req → Capstone Project 1: Build a Production RAG System with Guardrails
5.3 Containerization
Provide a `Dockerfile` for the application. - Provide a `docker-compose.yml` that launches the API, vector database, and any other dependencies. - Include environment variable configuration for all secrets and model paths. → Capstone Project 1: Build a Production RAG System with Guardrails
5.3 Infrastructure
Provide Docker Compose configuration that launches all services: - Application server (FastAPI + agent logic). - Vector database (Qdrant or ChromaDB). - Metadata database (PostgreSQL or SQLite). - (Optional) Local model server (Ollama or vLLM). - Document all environment variables and configuration → Capstone Project 3: End-to-End Multimodal AI Application
5.3 Performance Optimization
Benchmark the serving setup: - Latency: Time to first token (TTFT) and inter-token latency (ITL) at various input lengths (128, 512, 1024, 2048 tokens). - Throughput: Maximum requests per second at different concurrency levels (1, 4, 8, 16, 32 concurrent requests). - Use a load-testing tool (e.g., ` → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
5.4 Monitoring and Logging
Log every request: timestamp, input length, output length, latency, tokens/second, model parameters (temperature, top_p). - Track GPU utilization and memory in real time. - Implement a simple dashboard or log summary script. → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
Create a `Dockerfile` for the complete serving stack. - Provide a `docker-compose.yml` if multiple services are involved. - Document GPU passthrough configuration for Docker. - Include a startup script that handles model download/loading. → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
5.5 Monitoring
Implement structured logging for all operations. - Track and report: - Request volume by type. - Latency breakdown by pipeline stage (vision, retrieval, generation). - Error rates by type and severity. - Model usage and cost estimates. - Storage usage (knowledge base size). - User feedback statistic → Capstone Project 3: End-to-End Multimodal AI Application
6. AI Ethics and Policy Specialist
Focus: Navigate the regulatory, ethical, and societal dimensions of AI. - Skills: Law, policy analysis, stakeholder engagement, technical understanding. - Trajectory: Policy analyst → AI policy lead → chief AI ethics officer. → Chapter 40: The Future of AI Engineering
6.1 End-to-End Evaluation
Create a test set of at least 50 questions with ground-truth answers and source documents. - Evaluate with the following metrics: - **Answer quality**: Human evaluation on a 1-5 scale for correctness, completeness, and clarity. Also compute automated metrics (ROUGE-L, BERTScore against reference ans → Capstone Project 1: Build a Production RAG System with Guardrails
6.2 Ablation Study
Measure the impact of each component by disabling it and re-evaluating: - Without query rewriting. - Without re-ranking. - Without output guardrails. - Dense-only vs. hybrid retrieval. - Present results in a comparison table. → Capstone Project 1: Build a Production RAG System with Guardrails
6.2 Human Evaluation
Recruit 2-3 evaluators (can be teammates or volunteers). - Have each evaluator interact with the system on 20 tasks (covering all scenario types). - Collect ratings (1-5) for: - **Correctness**: Is the response factually accurate? - **Relevance**: Does the response address the user's actual question → Capstone Project 3: End-to-End Multimodal AI Application
6.2 Live Demo
Demonstrate the deployed model answering domain-specific questions. - Show side-by-side comparison with the base model. - Demonstrate the monitoring dashboard. - Be prepared to answer questions about design decisions, failure modes, and alternative approaches. → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
Compare your system's performance to: - A text-only baseline (same system without image understanding). - A no-agent baseline (direct LLM call without tool use). - (If budget allows) A commercial API baseline (GPT-4o or Claude with vision). - Present comparisons in tables and/or charts. → Capstone Project 3: End-to-End Multimodal AI Application
6.3 Documentation
System architecture document with diagrams. - API reference (auto-generated plus any additional notes). - Deployment guide (local development, Docker, and cloud deployment instructions). - Known limitations and future work. → Capstone Project 1: Build a Production RAG System with Guardrails
6.4 Model Card and System Documentation
Produce a model card for the overall system including: - Intended use cases and users. - Input types and limitations. - Known failure modes (with examples). - Bias and fairness considerations. - Environmental impact estimate. - Produce a system documentation package: - Architecture document with dia → Capstone Project 3: End-to-End Multimodal AI Application
6.6 Presentation and Demo
Prepare a 20-minute presentation covering the key aspects of the project. - Include a live demo showing: - A text-only interaction. - An image understanding interaction. - A multi-step task requiring agent capabilities. - A cross-modal search. - A guardrail activation. - Be prepared for Q&A. → Capstone Project 3: End-to-End Multimodal AI Application
**Paper reimplementation**: Implementing a paper from scratch forces understanding that reading alone cannot achieve. Start with older, well-documented papers and work toward recent ones. - **Kaggle competitions**: Provide structured problems with well-defined evaluation, exposing you to practical t → Chapter 40: The Future of AI Engineering
Safety benchmarks: TruthfulQA, HarmBench, BBQ - Capability benchmarks: MMLU, HumanEval, GSM8K (to detect capability regression) - User satisfaction: if in production, track thumbs up/down ratings → Chapter 25: Alignment: RLHF, DPO, and Beyond
AI engineering
is still taking shape. Unlike software engineering, which has had decades to formalize its practices, or data science, which crystallized into a recognized profession in the 2010s, AI engineering sits at a dynamic intersection of mathematics, computer science, systems design, and domain expertise. I → Chapter 1: The Landscape of AI Engineering
AI has evolved through distinct eras
symbolic AI, classical machine learning, deep learning, and the transformer/foundation model era --- each contributing ideas and techniques that remain relevant today. → Chapter 1: The Landscape of AI Engineering
AI is a constellation of subfields
machine learning, NLP, computer vision, robotics, speech processing, and generative AI --- that increasingly intersect and combine. → Chapter 1: The Landscape of AI Engineering
Albumentations
Description: Fast image augmentation library. Provides geometric and photometric transformations for creating augmented training samples. - Usage: `pip install albumentations` - Chapters: 8, 22. → Appendix D: Data Sources and Datasets
Anthropic API
Description: Access to Claude models. Used for generating high-quality synthetic data, especially for complex reasoning tasks. - Chapters: 15, 17, 30. → Appendix D: Data Sources and Datasets
Anthropic HH-RLHF
Description: Human preference data for helpfulness and harmlessness. Contains pairs of model responses with human preference labels. - Chapters: 16, 17. → Appendix D: Data Sources and Datasets
Architecture tips:
Use transposed convolutions with kernel size divisible by stride to avoid checkerboard artifacts. Better yet, use nearest-neighbor upsampling followed by a regular convolution. - Avoid max pooling in the discriminator; use strided convolutions instead. - Use spectral normalization in the discriminat → Chapter 17: Generative Adversarial Networks
area
the product of these two features. Creating `lot_area = lot_length * lot_width` gives the model direct access to this relationship without requiring it to learn a multiplicative interaction from additive terms. → Chapter 9: Feature Engineering and Data Pipelines
Atom features (per node):
Atom type: one-hot encoding of C, N, O, S, F, Cl, Br, I, P, Other (10 dims) - Degree: node degree / 4 (1 dim) - Formal charge: charge / 2 (1 dim) - Number of hydrogens: count / 4 (1 dim) - Is aromatic: binary (1 dim) - Is in ring: binary (1 dim) → Case Study 1: Molecular Property Prediction with GNNs
Atrous Spatial Pyramid Pooling (ASPP)
parallel dilated convolutions at multiple rates that capture multi-scale context. This addresses a fundamental challenge in segmentation: objects appear at many different scales, and the network must recognize both a small bird and a large building in the same image. → Chapter 14: Convolutional Neural Networks
Audio Classification:
**mean Average Precision (mAP)**: Standard for multi-label classification (AudioSet) - **Accuracy**: For single-label tasks → Chapter 29: Speech, Audio, and Music AI
audio tokenization
converting continuous audio into discrete tokens that can be modeled with standard language model architectures. This mirrors how BPE tokenization converts continuous text into discrete tokens for language models (as we saw in Chapter 3), creating a universal interface between raw signals and transf → Chapter 29: Speech, Audio, and Music AI
*Breadth without depth*: Following every new trend without mastering any. Depth creates career capital; breadth provides context. - *Tutorial paralysis*: Endlessly following tutorials without building original projects. Tutorials are training wheels; at some point, you must remove them. - *Hype-driv → Chapter 40: The Future of AI Engineering
AWS Open Data
URL: https://registry.opendata.aws - Description: Hundreds of datasets available directly on S3. Includes satellite imagery, genomics data, and Common Crawl. - Chapters: 28. → Appendix D: Data Sources and Datasets
B
Bahdanau/Luong cross-attention:
$\mathbf{Q}$: decoder hidden states - $\mathbf{K}$: encoder hidden states - $\mathbf{V}$: encoder hidden states (K = V in most formulations) → Chapter 18: The Attention Mechanism
Allows higher learning rates without divergence - Acts as a mild regularizer (due to batch noise) - Reduces sensitivity to weight initialization - Smooths the loss landscape → Chapter 12: Training Deep Networks
Benefits:
Improves model calibration (predicted probabilities better reflect true likelihoods). - Reduces overfitting, especially for datasets with noisy labels. - Encourages the model to learn more discriminative features for the penultimate layer. - Almost always helps for classification tasks. → Chapter 13: Regularization and Generalization
Maintain a personal "lab notebook" of experiments: mini-projects testing ideas, ablation studies on your own models, reproductions of published results. - When you learn a new technique, immediately apply it to a problem you care about. Abstract knowledge becomes concrete through application. - Keep → Chapter 40: The Future of AI Engineering
C
C4 (Colossal Clean Crawled Corpus)
Description: A cleaned version of Common Crawl used to train the T5 model. Approximately 750GB of text. - Access: `datasets.load_dataset("c4", "en")` - Chapters: 14, 15. → Appendix D: Data Sources and Datasets
Categorical Features:
[ ] Assess cardinality: use one-hot encoding for low-cardinality and target or frequency encoding for high-cardinality features. - [ ] Check for unseen categories in validation/test data and handle gracefully. - [ ] Consider ordinal encoding only when a natural ordering exists. → Chapter 9: Feature Engineering and Data Pipelines
Challenges encountered:
Patient data was stored in multiple incompatible formats across different hospital systems (HL7 v2, FHIR, proprietary CSV exports). - Data quality was inconsistent: missing values, coding errors, and temporal inconsistencies were common. - HIPAA compliance requirements imposed strict constraints on → Case Study 1: From Rule-Based Systems to Deep Learning at a Healthcare Company
Challenges:
**Underflow:** Small gradient values become zero in fp16 (minimum positive value: ~6e-8). - **Overflow:** Large values exceed fp16 range (max: 65504). - **Loss of precision:** Accumulated rounding errors can affect training. → Chapter 35: Distributed Training and Scaling
Homogeneous numerical data -> NumPy array - Tabular data with mixed types -> pandas DataFrame - Sparse data (mostly zeros) -> scipy.sparse - Key-value pairs -> Python dict → Key Takeaways: Python for AI Engineering
CIFAR-10 / CIFAR-100
Description: 60,000 32x32 color images in 10 (or 100) classes. A standard dataset for prototyping and experimentation due to its small size. - License: MIT. - Chapters: 7, 8. - Access: `torchvision.datasets.CIFAR10(root="./data", download=True)` → Appendix D: Data Sources and Datasets
Clustering
partitioning data into meaningful groups 2. **Dimensionality reduction** --- projecting high-dimensional data into lower-dimensional spaces while preserving essential structure 3. **Anomaly detection** --- identifying data points that deviate significantly from the majority → Chapter 7: Unsupervised Learning and Dimensionality Reduction
CNN/DailyMail
Description: 300,000+ article-summary pairs from CNN and DailyMail news articles. The standard benchmark for abstractive summarization. - Access: `datasets.load_dataset("cnn_dailymail", "3.0.0")` - Chapters: 12, 15. → Appendix D: Data Sources and Datasets
COCO (Common Objects in Context)
URL: https://cocodataset.org - Description: 330,000+ images with 80 object categories. Includes annotations for object detection, segmentation, keypoints, and image captioning. - License: CC BY 4.0. - Chapters: 22, 23. → Appendix D: Data Sources and Datasets
URL: https://commoncrawl.org - Description: Petabytes of raw web data collected monthly since 2008. The basis for many pre-training datasets. - Size: Petabytes (raw); filtered subsets vary. - License: Open; content licensing varies per page. - Chapters: 14, 15. → Appendix D: Data Sources and Datasets
Common Crawl Index
Description: Query interface to search Common Crawl archives by URL pattern or domain without downloading the full archive. - Chapters: 28. → Appendix D: Data Sources and Datasets
Common distributions
Bernoulli, Categorical, Gaussian, and others -- are the building blocks for modeling data in AI systems. 3. **Expectation and variance** summarize distributions and appear in loss functions, evaluation metrics, and optimization. 4. **Maximum likelihood estimation** finds parameters that maximize the → Chapter 4: Probability, Statistics, and Information Theory
Common Voice (Mozilla)
URL: https://commonvoice.mozilla.org - Description: The world's largest open multilingual voice dataset, with contributions in 100+ languages. Crowdsourced recordings with validated transcriptions. - Size: 20,000+ hours across all languages. - License: CC-0 (public domain). - Chapters: 24, 25. → Appendix D: Data Sources and Datasets
the idea that intelligence emerges from networks of simple processing units. The backpropagation algorithm, popularized by David Rumelhart, Geoffrey Hinton, and Ronald Williams in 1986, provided a practical method for training multi-layer neural networks. For the first time, researchers had a genera → Chapter 1: The Landscape of AI Engineering
Considerations:
Do not use label smoothing for knowledge distillation (the soft teacher labels already provide smoothing). - Common values: 0.1 for most tasks, 0.05 for tasks with very clean labels, 0.2 for noisy labels. - Label smoothing was a key component in the original Transformer paper (Chapter 14 will discus → Chapter 13: Regularization and Generalization
**NumPy**: The foundational library for numerical computing in Python. NumPy's n-dimensional arrays and vectorized operations form the bedrock upon which the entire Python AI ecosystem is built. You will use NumPy extensively throughout Chapters 1--10 of this book. - **SciPy**: Scientific computing → Chapter 1: The Landscape of AI Engineering
CTGAN (Conditional Tabular GAN)
Description: GAN-based approach for generating synthetic tabular data that preserves statistical properties of the original dataset. - Usage: `pip install ctgan` - Chapters: 6, 14. → Appendix D: Data Sources and Datasets
D
Data preprocessing:
Normalize images to $[-1, 1]$ (matching Tanh output in the generator) rather than $[0, 1]$. - Use data augmentation carefully: random horizontal flips are generally safe, but aggressive augmentations can confuse the discriminator. - For small datasets, consider differentiable augmentation (DiffAugme → Chapter 17: Generative Adversarial Networks
[ ] Examine data types, missing value patterns, and distributions for every feature. - [ ] Understand the domain context: what do the features represent physically? - [ ] Identify the target variable and check its distribution (balanced vs. imbalanced for classification, skewed vs. symmetric for reg → Chapter 9: Feature Engineering and Data Pipelines
Decoder:
Autoregressive text decoder with learned positional embeddings - Cross-attention to encoder outputs - Predicts tokens from a byte-level BPE vocabulary → Chapter 29: Speech, Audio, and Music AI
Neural network architectures (feedforward, CNN, RNN, transformer) - Training techniques (regularization, optimization, transfer learning) - Framework proficiency (PyTorch) → Chapter 1: The Landscape of AI Engineering
Deep learning frameworks:
**PyTorch**: The most widely used deep learning framework in both research and industry. Its dynamic computation graph and Pythonic design make it the preferred choice for most AI engineers. You will use PyTorch extensively in Chapters 11--22. - **TensorFlow / Keras**: Google's deep learning framewo → Chapter 1: The Landscape of AI Engineering
degradation problem
the surprising observation that simply adding more layers to a network can *decrease* accuracy, not because of overfitting, but because of optimization difficulties. → Chapter 14: Convolutional Neural Networks
Deployment strategy:
**Shadow mode** (Months 12--14): The ML models ran in parallel with the production rule-based system but did not surface predictions to clinicians. This allowed the team to monitor model behavior on live data without risk. - **Limited rollout** (Months 14--16): The hybrid system was deployed to 12 p → Case Study 1: From Rule-Based Systems to Deep Learning at a Healthcare Company
destroying the learned representations
a phenomenon sometimes called "catastrophic forgetting." The new classification head, initialized randomly, needs larger updates to learn meaningful weights quickly. Using **discriminative learning rates** (smaller for pretrained, larger for new layers) allows the pretrained features to adapt gently → Chapter 14: Quiz
Detection methods:
**Artifact detection**: Looking for subtle inconsistencies (lighting, reflections, lip sync, blinking patterns). First-generation deepfakes had obvious tells (blurring around face edges, inconsistent lighting), but modern generators have largely eliminated these. - **Forensic analysis**: Examining m → Chapter 39: AI Safety, Ethics, and Governance
Visual inspection of generated samples for lack of diversity. - Compute the number of distinct modes in generated data (e.g., using a pretrained classifier). For MNIST, classify 10,000 generated samples and count how many of the 10 digit classes are represented. - Monitor the discriminator's loss: i → Chapter 17: Generative Adversarial Networks
Disadvantages:
Computationally expensive for large datasets (must process all *N* examples before a single parameter update) - Cannot exploit redundancy in the data - More likely to get stuck in sharp local minima → Chapter 3: Calculus, Optimization, and Automatic Differentiation
Do the exercises
at minimum, all Part A (conceptual) and Part B (calculations) problems 5. **Study at least one case study** to see concepts applied to real scenarios 6. **Review key takeaways** before moving to the next chapter → How to Use This Book
the reasoning behind design choices matters as much as the code 5. **Evaluate rigorously** — every project includes evaluation criteria and benchmarks → Part IX: Capstone Projects
Chosen reward: $\hat{r}(y_w)$ should increase - Rejected reward: $\hat{r}(y_l)$ should decrease - Reward margin: $\hat{r}(y_w) - \hat{r}(y_l)$ should increase, but not explode - KL divergence from reference: should remain bounded (typically < 10 nats) - Accuracy: fraction of pairs correctly ordered → Chapter 25: Alignment: RLHF, DPO, and Beyond
During SFT:
Training loss (should decrease smoothly) - Validation loss (should decrease; divergence from training loss indicates overfitting) - Response quality samples (manual inspection of generated responses) → Chapter 25: Alignment: RLHF, DPO, and Beyond
E
embodied AI
AI systems that interact with the physical world through robotic bodies. The gap between simulated and physical environments (the "sim-to-real gap") is one of the central challenges in robotics. → Chapter 40: The Future of AI Engineering
emergent abilities
capabilities that appear only in models above a certain scale and are not predictable by extrapolating from smaller models. Wei et al. (2022) defined an emergent ability as one that is "not present in smaller models but is present in larger models," with performance transitioning sharply from near-c → Chapter 22: Scaling Laws and Large Language Models
Emerging compute paradigms:
**Optical computing**: Using light for matrix multiplication, potentially achieving orders-of-magnitude energy savings. Companies like Lightmatter and Luminous are developing optical interconnects and compute elements. - **Neuromorphic computing**: Chips inspired by biological neural networks (Intel → Chapter 40: The Future of AI Engineering
Encoder:
Input: 80-channel log-mel spectrogram computed from 30-second audio segments (padded or trimmed) - Two 1D convolution layers with GELU activation for initial feature processing - Sinusoidal positional encoding - Standard Transformer encoder blocks with pre-layer normalization → Chapter 29: Speech, Audio, and Music AI
Exercise 1.1
*Historical Epochs* Identify the four major epochs of AI described in this chapter. For each epoch, name one key achievement and one key limitation that motivated the transition to the next era. → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.10
*Model Parameter Counting* A simple neural network has an input layer of size 784, one hidden layer of size 256, and an output layer of size 10. Calculate the total number of trainable parameters (weights and biases) in this network. → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.11
*Foundation Model Scaling* GPT-2 (2019) had 1.5 billion parameters. GPT-3 (2020) had 175 billion parameters. Calculate the growth factor. If this growth factor continued for two more generations (released one year apart), estimate the parameter count for each. Discuss whether this growth rate is sus → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.12
*Data Requirements* A supervised classification model needs approximately 1,000 labeled examples per class to achieve acceptable performance. If you are building a document classifier with 50 categories and labeling each document takes an average of 2 minutes, estimate: a) The total number of labele → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.13
*Inference Latency Budget* A movie recommendation system must return results within 200ms of a user request. The system has four stages: feature retrieval (50ms), candidate generation (Xms), ranking model inference (80ms), and post-processing (20ms). What is the maximum time allowed for candidate ge → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.14
*Cost Comparison* A company runs inference for a text classification model. Option A uses a GPU server costing $3.00/hour that can handle 500 requests/second. Option B uses a CPU server costing $0.50/hour that can handle 50 requests/second. If the company processes 1 million requests per day, calcul → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.15
*AI Timeline Extension* Extend the `example-01-ai-timeline.py` code to include at least five additional milestones from the history of AI that were not included in the original timeline. Customize the visualization with different colors for different categories of milestone (theoretical, engineering → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.16
*Parameter Counter* Write a Python function using NumPy that takes a list of layer sizes (e.g., `[784, 256, 128, 10]`) and returns the total number of trainable parameters in a fully connected neural network with those layer dimensions. → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.17
*Simple Linear Regression from Scratch* Using only NumPy, implement a simple linear regression model that fits a line $y = wx + b$ to a dataset. Your implementation should: a) Generate synthetic data: $y = 3x + 7 + \epsilon$ where $\epsilon \sim \mathcal{N}(0, 1)$ b) Implement gradient descent to le → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.18
*Ecosystem Audit* Write a Python script that audits your local AI/ML development environment. The script should: a) Check which of the following packages are installed: numpy, scipy, pandas, scikit-learn, matplotlib, torch, tensorflow, transformers b) Report the version of each installed package c) → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.19
*Decision Boundary Visualization* Using NumPy and matplotlib, create a visualization that shows the decision boundaries of three different classifiers on a 2D dataset: a) A simple threshold rule (symbolic/rule-based approach) b) A linear decision boundary (logistic regression-style) c) A non-linear → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.2
*Symbolic vs. Statistical* Explain the fundamental difference between symbolic AI and machine learning in your own words. Give a concrete example of a task where each approach might be preferred, and justify your choices. → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.20
*Forward Pass Implementation* Implement a complete forward pass for a three-layer neural network using only NumPy. Your implementation should include: a) Weight initialization (random normal, scaled by layer size) b) ReLU activation for hidden layers c) Softmax activation for the output layer d) A f → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.21
*Stack Trade-offs* Consider a startup building an AI-powered customer service chatbot. They have a team of three engineers and a budget of $5,000/month for cloud infrastructure. Recommend a specific technology at each layer of the AI stack, justifying your choices based on the team size and budget c → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.22
*Build vs. Buy* For each of the following scenarios, argue whether the company should build a custom AI solution or use an off-the-shelf service (e.g., a cloud AI API). Consider cost, performance, data privacy, competitive advantage, and maintenance burden. a) A bank needs a fraud detection system f → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.23
*Failure Mode Analysis* Choose one of the following AI systems and conduct a failure mode analysis. For each failure mode, describe the potential impact and propose a mitigation strategy. a) An AI system that screens job applications b) An AI system that recommends medical treatments c) An AI system → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.24
*Paradigm Comparison* The chapter describes the evolution from symbolic AI to machine learning to deep learning to foundation models. For the task of **language translation**, describe how each paradigm would approach the problem. What are the strengths and weaknesses of each approach? Which approac → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.25
*Ethical Analysis* A healthcare company wants to build an AI system that predicts which patients are at high risk of developing a chronic disease, so that preventive interventions can be offered early. Analyze this scenario from the perspective of: a) **Fairness**: What biases might be present in th → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.26
*Technology Radar* Create a "technology radar" for AI engineering. Categorize the following technologies into one of four rings: Adopt (proven, use now), Trial (worth pursuing, not yet proven at scale), Assess (worth exploring, not yet proven), Hold (proceed with caution): - PyTorch - TensorFlow - J → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.27
*Historical Deep Dive* Choose one of the following historical AI systems and write a 500-word analysis of its approach, achievements, and limitations. Discuss how the techniques it used relate to modern AI engineering practices. a) MYCIN (medical diagnosis expert system) b) Deep Blue (chess-playing → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.28
*Industry Case Study* Select an industry (healthcare, finance, automotive, retail, or another of your choice) and research how AI engineering practices are applied in that sector. Your analysis should cover: a) Three specific AI applications currently deployed in the industry b) The key technical ch → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.29
*Open Source Investigation* Explore the GitHub repositories of two major AI/ML frameworks (e.g., PyTorch, scikit-learn, Hugging Face Transformers, LangChain). For each repository: a) How many contributors does it have? b) How frequently is it updated? c) What is its license? d) What does the project → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.3
*The Three Ingredients* The deep learning revolution required three ingredients: data, compute, and algorithms. Rank these three in order of importance for the *initial* breakthrough (circa 2012), and separately for the *current* era of foundation models. Justify your ranking in each case. → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.30
*Future Forecast* Based on current trends in AI research and engineering, write a thoughtful forecast of where AI engineering will be in five years. Address: a) Which current techniques will remain dominant? b) What new capabilities might emerge? c) How will the role of the AI engineer change? d) Wh → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.4
*Foundation Model Paradigm Shift* Describe the paradigm shift from traditional ML to the foundation model approach. What are two advantages and two disadvantages of the foundation model paradigm from an AI engineer's perspective? → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.5
*Subfield Identification* For each of the following systems, identify which subfields of AI are involved and explain the role of each: a) A self-driving car navigating a city b) A chatbot that answers customer support questions c) A system that generates music in the style of a given artist d) A rob → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.6
*AI Stack Layers* Draw (or describe in text) the layers of the modern AI stack. For a company building a fraud detection system for credit card transactions, give a specific example of a tool or technology at each layer. → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.7
*Role Distinctions* A startup has a team of five people building an AI-powered medical imaging product. Describe how the responsibilities of an AI engineer, a data scientist, and a software engineer on this team would differ and overlap. → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.8
*AI Winters* What factors have historically caused "AI winters"? Do you think the current era of AI is susceptible to another AI winter? Provide at least three arguments for and three arguments against this possibility. → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.9
*Speedup Estimation* A training job takes 48 hours on a single CPU. Using the rule-of-thumb speedup factor of 10--100x for GPUs, estimate the range of training times on a single GPU. If you need the model trained within 2 hours, what is the minimum speedup factor required, and what might you do to a → Chapter 1 Exercises: The Landscape of AI Engineering
F
Faker
Description: Python library for generating realistic fake data (names, addresses, text, dates, etc.). Useful for generating test fixtures and tabular data. - Usage: `pip install faker` - Chapters: 28, 31. → Appendix D: Data Sources and Datasets
few-shot multimodal learning
learning new tasks from just a few interleaved image-text examples in the context window. → Chapter 28: Key Takeaways
Findings:
The diagonal has the highest values (each position is most similar to itself). - Off-diagonal values decay smoothly with distance. - The pattern is approximately shift-invariant: `similarity[i, j]` depends mainly on `|i - j|`. → Case Study 2: Analyzing Transformer Components
FineWeb / FineWeb-Edu
Description: High-quality filtered web text from HuggingFace, with an educational content subset particularly useful for training smaller models. - License: ODC-By 1.0. - Chapters: 14, 15. → Appendix D: Data Sources and Datasets
regions where the loss is low across a wide neighborhood of parameter values. - **Large batches** provide more accurate gradient estimates, which tend to converge to **sharp minima**---narrow valleys in the loss landscape where small perturbations in parameters cause large increases in loss. → Chapter 13: Regularization and Generalization
For efficiency:
If we could identify winning tickets before training, we could train much smaller networks from the start, saving computation. - This has motivated extensive research in **neural network pruning** and **sparse training**. → Chapter 13: Regularization and Generalization
For regularization:
The lottery ticket hypothesis suggests that much of the network's capacity is redundant. Regularization works partly by suppressing these redundant pathways. - Dropout can be seen as a stochastic way to search for winning tickets---it forces the network to find robust subnetworks. → Chapter 13: Regularization and Generalization
For understanding generalization:
The hypothesis suggests that what matters is not the total number of parameters but the structure of the connections. This partly explains why overparameterized networks generalize well---they provide a rich search space for finding good subnetworks. → Chapter 13: Regularization and Generalization
large models pre-trained on broad data that can be adapted to a wide range of downstream tasks. This paradigm shift fundamentally altered the economics and practice of AI engineering: → Chapter 1: The Landscape of AI Engineering
motivates much of the research discussed later in this chapter, including the double descent phenomenon (Section 13.9) and implicit regularization (Section 13.12). → Chapter 13: Regularization and Generalization
GigaSpeech
Description: 10,000 hours of English audio from audiobooks, podcasts, and YouTube. Designed as a large-scale ASR training corpus. - License: Apache 2.0. - Chapters: 24. → Appendix D: Data Sources and Datasets
URL: https://gluebenchmark.com - Description: A collection of nine sentence- and sentence-pair-level NLU tasks including sentiment analysis (SST-2), textual entailment (MNLI, RTE), paraphrase detection (MRPC, QQP), and linguistic acceptability (CoLA). - Size: Varies by task; SST-2 has approximately → Appendix D: Data Sources and Datasets
Google BigQuery Public Datasets
Description: Petabytes of publicly available datasets queryable via SQL. Includes GitHub activity data, Wikipedia, US Census data, and more. - Pricing: 1TB free queries per month. - Chapters: 28, 38. → Appendix D: Data Sources and Datasets
GPU training
the network was split across two GPUs due to memory constraints - **Local Response Normalization** (LRN) -- later superseded by batch normalization → Chapter 14: Convolutional Neural Networks
the model is selecting a single input position. Soft attention (where weights are distributed) computes an interpolation between multiple encoder states, enabling smoother gradient flow. → Quiz: The Attention Mechanism
hard labels
one-hot encoded targets where the correct class has probability 1 and all others have probability 0. This forces the model to predict increasingly extreme logits to minimize cross-entropy loss, which has two problems: → Chapter 13: Regularization and Generalization
HuggingFace Datasets Hub
URL: https://huggingface.co/datasets - Description: Centralized repository hosting 100,000+ datasets with a unified Python API. Supports streaming for large datasets. - Access: `datasets.load_dataset("dataset_name")` - Chapters: Used throughout the book. → Appendix D: Data Sources and Datasets
HuggingFace ecosystem
a collection of open-source libraries that has become the standard toolkit for working with pre-trained Transformer models. Throughout the remainder of this book, we will use these libraries alongside PyTorch. → Chapter 20: Pre-training and Transfer Learning for NLP
Identify a domain problem (healthcare, climate, education, materials science) and build an end-to-end solution using the techniques from this book. - Talk to domain experts. The most impactful AI applications come from deep understanding of the problem, not just deep understanding of the models. → Chapter 40: The Future of AI Engineering
If you are drawn to engineering:
Deploy a model to production---even a small personal project---and handle the full lifecycle: data, training, serving, monitoring. - Contribute to a major open-source ML project (PyTorch, Hugging Face, vLLM). - Build an evaluation suite for a domain you care about. → Chapter 40: The Future of AI Engineering
If you are drawn to research:
Read the top 10 most-cited papers from the last NeurIPS or ICML. Reimplement at least one. - Pick an open problem from a recent survey paper and attempt a small contribution. - Join a research reading group or start one at your organization. → Chapter 40: The Future of AI Engineering
If you are drawn to safety and governance:
Study the mechanistic interpretability research from Anthropic and DeepMind in depth. - Participate in an AI safety research program (MATS, SERI, Redwood Research). - Read the EU AI Act and build a compliance checklist for a hypothetical high-risk system. → Chapter 40: The Future of AI Engineering
ImageNet (ILSVRC)
URL: https://www.image-net.org - Description: 1.28 million training images across 1,000 classes. The foundational benchmark for image classification. ImageNet-21k has approximately 14 million images in 21,841 classes. - License: Research use only (requires registration). - Chapters: 8, 22. → Appendix D: Data Sources and Datasets
Always use the same Inception checkpoint and preprocessing for all comparisons. - FID is sensitive to the number of samples: fewer samples lead to higher variance. Report the number of samples used. - FID computed on different random seeds will vary. Compute FID multiple times and report the mean an → Chapter 17: Generative Adversarial Networks
In this chapter, you will learn to:
Manipulate probabilities using the sum rule, product rule, and Bayes' theorem - Work with discrete and continuous distributions in NumPy - Estimate parameters from data using MLE and MAP - Compute and interpret entropy, cross-entropy, and KL divergence - Apply mutual information to measure statistic → Chapter 4: Probability, Statistics, and Information Theory
Inference
the process of generating predictions from a trained model—happens millions or billions of times. For many organizations, inference costs dwarf training costs within months of deployment. → Chapter 33: Inference Optimization and Model Serving
K
Kaggle API
Description: Programmatic access to download competition data, public datasets, and submit predictions. - Usage: `pip install kaggle && kaggle datasets download -d ` → Appendix D: Data Sources and Datasets
Kaggle Datasets
URL: https://www.kaggle.com/datasets - Description: Community-contributed datasets spanning virtually every domain. Kaggle also hosts competitions with curated datasets and leaderboards. - License: Varies per dataset. - Chapters: 3, 4, 5, 6, 38. - Access: `kaggle datasets download -d ` → Appendix D: Data Sources and Datasets
Key augmentation choices for medical imaging:
**Conservative color jitter:** Color information (redness, pigmentation) is diagnostically important, so hue shifts are kept small. - **Rotation up to 90 degrees:** Dermoscopic images can be captured at any orientation. - **Random erasing:** Simulates partial occlusion by hair, bubbles, or artifacts → Case Study 1: Preventing Overfitting in a Medical Imaging Model
Key challenges encountered:
**Class imbalance**: Fraudulent transactions represented only 0.12% of all transactions. The team used SMOTE oversampling and cost-sensitive learning to address this. - **Feature engineering at scale**: Computing real-time features (e.g., "number of transactions in the last hour") required a streami → Case Study 2: Building an AI Engineering Team from Scratch
Key engineering decisions:
The team used scikit-learn for feature engineering and classical ML, and PyTorch for the deep learning models. - All experiments were tracked in MLflow, enabling reproducibility and systematic comparison. - Feature engineering was a critical step: temporal features (rate of change in vital signs), i → Case Study 1: From Rule-Based Systems to Deep Learning at a Healthcare Company
Key observations:
RoBERTa achieves the highest accuracy, consistent with its improved training recipe. - DistilBERT is only 0.7 percentage points behind BERT with 40% fewer parameters. - ALBERT has the lowest accuracy, likely because its shared parameters limit capacity despite having 12 layers. - T5-Small performs c → Case Study 2: Comparing Pre-trained Models on a Custom NLP Task
Always call `model.train()` before training and `model.eval()` before evaluation. This is the most common dropout-related bug (as we noted in Chapter 12 regarding batch normalization). - Dropout interacts with batch normalization. The conventional wisdom is to **not** use dropout in the same block a → Chapter 13: Regularization and Generalization
L
LAION-5B
Description: 5.85 billion image-text pairs scraped from the web. Used to train open-source vision-language models like Stable Diffusion. - License: CC BY 4.0 (metadata); images are linked, not redistributed. - Chapters: 22, 23. → Appendix D: Data Sources and Datasets
that captures the essential factors of variation in the data. This simple idea, when extended with probabilistic reasoning (Variational Autoencoders), regularization (sparse and denoising variants), and modern training paradigms (contrastive and self-supervised learning), has become one of the pilla → Chapter 16: Autoencoders and Representation Learning
the ability to quickly acquire new knowledge and skills as the landscape shifts. This is not an innate talent but a trainable capability: → Chapter 40: The Future of AI Engineering
LibriSpeech
URL: https://www.openslr.org/12 - Description: 1,000 hours of read English speech from audiobooks. The standard benchmark for automatic speech recognition (ASR). - License: CC BY 4.0. - Chapters: 24. - Access: `datasets.load_dataset("librispeech_asr", "clean")` → Appendix D: Data Sources and Datasets
Limitations of PCA:
Captures only linear relationships; nonlinear structure requires t-SNE, UMAP, or autoencoders - Assumes variance equals importance (not always true---a low-variance feature could be the most predictive) - Components can be difficult to interpret, especially when many features contribute to each comp → Chapter 7: Unsupervised Learning and Dimensionality Reduction
Limitations:
Behavior differs between training and evaluation modes (source of subtle bugs) - Performance degrades with very small batch sizes (batch statistics become noisy) - Not ideal for sequence models where batch statistics mix different sequence lengths → Chapter 12: Training Deep Networks
LLM-Based Generation
Description: Using large language models to generate training data, synthetic conversations, and evaluation sets. This is now the dominant approach for creating instruction-following datasets. - Key technique: Provide few-shot examples in a prompt, then sample diverse completions with temperature > → Appendix D: Data Sources and Datasets
LOF vs. Isolation Forest:
LOF is density-based and excels at detecting **local anomalies**---points that are anomalous relative to their neighborhood, even if they are in a globally dense region. - Isolation Forest is better for **global anomalies** and scales better to high dimensions. - LOF requires computing k-nearest nei → Chapter 7: Unsupervised Learning and Dimensionality Reduction
Classical algorithms (regression, trees, SVMs, clustering) - Feature engineering - Model evaluation and selection - Experiment design → Chapter 1: The Landscape of AI Engineering
Machine learning frameworks:
**scikit-learn**: The standard library for classical ML algorithms --- decision trees, SVMs, k-means clustering, and more. It provides a consistent API for training, evaluating, and deploying models. - **XGBoost / LightGBM**: High-performance gradient boosting libraries that dominate tabular data co → Chapter 1: The Landscape of AI Engineering
margin
the distance between the decision boundary and the nearest data points from each class. This geometric perspective leads to a model with strong theoretical guarantees and excellent practical performance. → Chapter 6: Supervised Learning: Regression and Classification
marginalization
the process of summing (or integrating) over variables we do not observe. In AI, marginalization appears whenever we compute predictions that account for all possible values of latent variables, as in mixture models and variational autoencoders (Chapter 16). → Chapter 4: Probability, Statistics, and Information Theory
Match the tokenizer to the model
never mix tokenizers from different pre-trained models. 3. **Domain-specific pre-training** (further pre-training on domain text before fine-tuning) can significantly improve results for specialized domains. 4. **Gradient accumulation** enables effective large batch sizes on limited GPU memory. 5. * → Chapter 20 Key Takeaways
Mathematical foundations (Chapters 2--6):
Linear algebra (vectors, matrices, decompositions) - Calculus (gradients, optimization) - Probability and statistics (distributions, estimation, hypothesis testing) - Optimization theory (gradient descent, convex optimization) → Chapter 1: The Landscape of AI Engineering
MMLU (Massive Multitask Language Understanding)
URL: https://github.com/hendrycks/test - Description: 57 multiple-choice tasks spanning STEM, humanities, social sciences, and more. A standard benchmark for evaluating LLM knowledge breadth. - Size: Approximately 15,000 test questions. - License: MIT. - Chapters: 15, 16, 31. → Appendix D: Data Sources and Datasets
Model Performance Metrics:
Prediction accuracy (when ground truth is available) - Prediction confidence distributions - Prediction latency (p50, p95, p99) - Throughput (requests per second) - Error rates → Chapter 34: MLOps and LLMOps
Monitoring infrastructure:
Real-time model performance tracking (sensitivity, specificity, alert rates). - Data drift detection comparing incoming patient data distributions to training data. - Alert fatigue monitoring (tracking how often clinicians acknowledged vs. dismissed alerts). - Automated retraining pipeline triggered → Case Study 1: From Rule-Based Systems to Deep Learning at a Healthcare Company
Monitoring:
Save generated samples every few hundred iterations for visual inspection. - Track the discriminator's accuracy on real and fake data separately. If it reaches 100% on both, training is likely in a good state. If it reaches 50% on both, the generator is winning. If it oscillates wildly, training is → Chapter 17: Generative Adversarial Networks
MusicCaps
Description: 5,521 music clips with text descriptions, created by Google for music understanding and generation tasks. - Chapters: 25. → Appendix D: Data Sources and Datasets
N
Natural Questions (NQ)
URL: https://ai.google.com/research/NaturalQuestions - Description: Real questions from Google Search paired with Wikipedia articles containing the answers. Includes both short and long answer annotations. - Size: 307,000+ training examples. - License: Apache 2.0. - Chapters: 20, 26. → Appendix D: Data Sources and Datasets
No probabilistic interpretation
attention weights would not form a distribution over positions. 4. **Gradient flow** would be altered --- softmax provides a natural gradient that encourages competition between positions. → Quiz: The Attention Mechanism
[ ] Check for outliers and decide on a strategy (removal, capping, robust scaling). - [ ] Apply appropriate scaling (standard, min-max, or robust) based on the model type. - [ ] Consider log or power transforms for skewed distributions. - [ ] Create domain-relevant interaction and ratio features. → Chapter 9: Feature Engineering and Data Pipelines
O
Observations from PCA:
Two components capture only about 22% of total variance, meaning the 2D PCA plot is a very lossy representation. - Some digit classes partially overlap. For example, digits 3, 5, and 8 tend to occupy similar regions. - The global layout is informative: digit 0 is far from digit 1, which makes intuit → Case Study 2: Visualizing High-Dimensional Data with t-SNE and UMAP
Observations from the perplexity study:
**Perplexity 5**: Very tight, fragmented clusters. Some digits split into sub-clusters (e.g., different writing styles of "7"). Noise dominates the layout. - **Perplexity 15**: Clusters begin to consolidate. Most digits form coherent groups, but some remain fragmented. - **Perplexity 30** (default): → Case Study 2: Visualizing High-Dimensional Data with t-SNE and UMAP
Observations from the UMAP parameter study:
**n_neighbors=5**: Very local focus. Clusters are tight and separated by large gaps. Sub-structure within clusters is visible. - **n_neighbors=15** (default): Good balance. Clusters are well-separated, and the global layout is meaningful---visually similar digits (e.g., 3, 5, 8) are placed near each → Case Study 2: Visualizing High-Dimensional Data with t-SNE and UMAP
Open Assistant (OASST)
Description: 160,000+ human-annotated conversations for training assistant-style models. Includes ranking information for RLHF. - License: Apache 2.0. - Chapters: 16, 17. → Appendix D: Data Sources and Datasets
OpenAI API
Description: Access to GPT-4 and related models. Commonly used for generating synthetic training data, labels, and evaluations. - Chapters: 15, 17, 30. → Appendix D: Data Sources and Datasets
OpenML
URL: https://www.openml.org - Description: A platform for sharing machine learning datasets, tasks, and experiments. Provides standardized interfaces for benchmarking. - License: Varies per dataset. - Chapters: 3, 6. → Appendix D: Data Sources and Datasets
Optimization tips:
Use Adam with $\beta_1 = 0.0$ and $\beta_2 = 0.9$ for WGAN-GP (note: $\beta_1 = 0$, not the default 0.9). - Learning rates between $10^{-4}$ and $2 \times 10^{-4}$ work well for most architectures. - If training diverges, reduce the learning rate rather than adding regularization. - Save checkpoints → Chapter 17: Generative Adversarial Networks
overconfident
they assign high probabilities to their predictions more often than is warranted. **Temperature scaling** (dividing logits by a learned temperature $T > 1$ before softmax) is the simplest and most effective post-hoc calibration method. → Chapter 4: Probability, Statistics, and Information Theory
[ ] Wrap all preprocessing in a scikit-learn Pipeline. - [ ] Verify no data leakage by checking that all `fit` calls use only training data. - [ ] Use cross-validation to evaluate the impact of each feature engineering decision. - [ ] Apply feature selection to remove noise and redundancy. → Chapter 9: Feature Engineering and Data Pipelines
Point Classification:
**Core point**: Has at least `min_samples` neighbors within $\varepsilon$ - **Border point**: Within $\varepsilon$ of a core point but does not have `min_samples` neighbors itself - **Noise point**: Neither core nor border---these are the outliers → Chapter 7: Unsupervised Learning and Dimensionality Reduction
posterior
our updated belief about hypothesis $H$ after observing data $D$ - $P(D \mid H)$ is the **likelihood** -- the probability of observing the data under the hypothesis - $P(H)$ is the **prior** -- our belief about $H$ before seeing data - $P(D)$ is the **evidence** (or marginal likelihood) -- a normali → Chapter 4: Probability, Statistics, and Information Theory
including deduplication, quality filtering, and domain balance --- is as important as model architecture for achieving strong performance. - **Parameter-efficient fine-tuning** methods (LoRA, adapters, prefix tuning) enable adapting large models to new tasks while updating less than 1% of parameters → Chapter 20: Pre-training and Transfer Learning for NLP
Python (the lingua franca of AI) - Software design patterns and best practices - Version control (Git) - Testing and debugging - API design → Chapter 1: The Landscape of AI Engineering
Properties:
Output range: (0, 1), useful for probabilities - Smooth and differentiable everywhere - Saturates for large |*z*|, causing the *vanishing gradient problem* - Output is not zero-centered, which can slow convergence → Chapter 11: Neural Networks from Scratch
PyTorch
The deep learning framework whose eager execution model and Pythonic design philosophy make it the ideal teaching tool - **HuggingFace** — For democratizing access to transformer models through the Transformers, Tokenizers, Datasets, and PEFT libraries - **NumPy and SciPy** — The bedrock of scientif → Acknowledgments
Q
Quarter 1: Deep Learning and Transformers
Moved from scikit-learn to PyTorch for daily work - Studied transformer architecture, attention mechanisms, scaling laws - Fine-tuned a model for an internal NLP task (replacing an older sklearn pipeline) - Milestone: Delivered 8% accuracy improvement using transformer model → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 1: Deep Learning Foundations
Completed fast.ai course and PyTorch tutorials - Built 3 projects: image classifier, text classifier, fine-tuned BERT - Studied transformer architecture in depth (attention, positional encoding) - Milestone: Reproduced a simplified GPT-2 training run → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 1: ML Fundamentals Depth
Completed Stanford CS229 (online) for mathematical foundations - Studied PyTorch deeply: custom datasets, training loops, distributed training - Leveraged existing distributed systems knowledge to understand DDP and FSDP - Milestone: Trained a model on 4 GPUs using PyTorch DDP → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 2: LLMs and Applications
Studied LLM capabilities, prompt engineering, RAG architecture - Led a cross-functional project to evaluate LLM integration opportunities - Built proof-of-concept for three internal use cases - Milestone: Secured executive buy-in for LLM pilot project → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 2: ML Infrastructure
Studied MLOps: experiment tracking (W&B), model serving (TorchServe, TGI) - Learned about GPU profiling, memory optimization, quantization - Built an internal tool for experiment comparison at current company - Milestone: Reduced model serving latency by 40% at work → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 2: Production ML
Learned Docker, Kubernetes basics, CI/CD - Built an end-to-end ML pipeline: data processing -> training -> API serving - Contributed to an open-source ML project (documentation + bug fix) - Milestone: Deployed a model behind a FastAPI endpoint → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 3: AI Engineering Leadership
Studied AI safety, responsible AI, evaluation frameworks - Developed team evaluation standards for LLM-powered features - Mentored two junior engineers in deep learning - Milestone: Hired and onboarded two AI engineers for the pilot team → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 3: LLM Engineering
Studied RAG, prompt engineering, fine-tuning with LoRA - Built a RAG chatbot for a personal project (cooking recipes) - Learned evaluation methodologies (RAGAS, human evaluation) - Milestone: Published a blog post about RAG evaluation → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 3: LLM Infrastructure
Studied inference optimization: KV caching, speculative decoding, vLLM - Learned about distributed inference for large models - Contributed to an open-source inference framework - Milestone: Deployed a production LLM serving pipeline → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 4: Job Search + Advanced Topics
Studied system design for ML (Chapter 33) - Practiced ML system design interviews - Explored reinforcement learning basics - Milestone: Received and accepted a junior ML engineer offer → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 4: Specialization + Transition
Deep dive into distributed training (FSDP, DeepSpeed) - Built a training framework prototype for their team - Published an internal tech talk on efficient LLM serving - Milestone: Transitioned to ML infrastructure engineer role (internal transfer) → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 4: Strategy and Scale
Developed AI engineering roadmap for the product org - Established MLOps practices: experiment tracking, monitoring, A/B testing - Presented AI strategy to C-suite - Milestone: Promoted to AI Engineering Lead → Case Study 2: Building a Career Roadmap in AI Engineering
R
Recipe 2: Training a Transformer for NLP
Optimizer: AdamW with betas (0.9, 0.98), weight decay 0.01 - Learning rate: Peak 5e-4, linear warmup for 4,000 steps, then cosine decay - Batch size: Effective 256--2048 (with gradient accumulation) - Gradient clipping: max_norm=1.0 - Dropout: 0.1 on attention and feed-forward layers - Mixed precisi → Chapter 12: Training Deep Networks
Recipe 3: Fine-Tuning a Pretrained Model
Optimizer: AdamW with weight decay 0.01 - Learning rate: 1e-5 to 5e-5 for pretrained layers, 10x higher for new head - Warmup: 100--500 steps - Epochs: 3--10 (much less than training from scratch) - Gradient clipping: max_norm=1.0 - Use parameter groups to set different learning rates for backbone v → Chapter 12: Training Deep Networks
Description: An open reproduction of the LLaMA training data, consisting of approximately 1.2 trillion tokens from Common Crawl, C4, GitHub, books, arXiv, Wikipedia, and StackExchange. - License: Apache 2.0. - Chapters: 14, 15. → Appendix D: Data Sources and Datasets
relative position bias
a learnable bias term added to the attention logits based on the relative spatial distance between tokens, parameterized by a table of size $(2M-1) \times (2M-1)$. This differs from ViT's **absolute position embeddings**, which are learnable vectors added to each patch embedding based on its absolut → Chapter 26: Quiz — Vision Transformers and Modern Computer Vision
Reporting accuracy on imbalanced data
Use F1, AUC-PR, or recall at fixed precision instead. 2. **Fitting the scaler on all data before splitting** -- Fit on training data only. 3. **Using standard k-fold for time series** -- Use TimeSeriesSplit instead. 4. **Reporting a single number without uncertainty** -- Always include standard devi → Chapter 8: Key Takeaways
rule-based architecture
the direct descendant of the expert systems described in Section 1.1.1 of this chapter. Over 4,500 hand-crafted if-then rules, developed in collaboration with emergency medicine physicians, encode clinical knowledge about symptom patterns, risk factors, vital sign thresholds, and diagnostic protocol → Case Study 1: From Rule-Based Systems to Deep Learning at a Healthcare Company
Rules of thumb:
Start with K-means for a quick baseline - Use DBSCAN when you expect non-convex clusters or significant noise - Use GMMs when you need probabilistic assignments or ellipsoidal clusters - Use hierarchical clustering when you want to explore multiple granularity levels - Use spectral clustering when c → Chapter 7: Unsupervised Learning and Dimensionality Reduction
S
scale
more parameters, more data, and more compute. The architecture itself changed remarkably little. As we will explore in Chapter 22, the scaling laws that govern this relationship are among the most important empirical findings in modern AI. → Chapter 21: Decoder-Only Models and Autoregressive Language Models
Scoring Guide:
★ Foundational (5-10 min each) - ★★ Intermediate (10-20 min each) - ★★★ Challenging (20-40 min each) - ★★★★ Advanced/Research (40+ min each) → Exercises: Python for AI Engineering
Scrapy
Description: Python framework for building web scrapers. Useful for collecting domain-specific text data. - Chapters: 28. → Appendix D: Data Sources and Datasets
SDV (Synthetic Data Vault)
Description: Comprehensive library for generating synthetic relational, tabular, and time-series data using statistical and deep learning models. - Usage: `pip install sdv` - Chapters: 6, 28. → Appendix D: Data Sources and Datasets
$\mathbf{Q}$, $\mathbf{K}$, $\mathbf{V}$: all derived from the *same* input sequence, typically through learned linear projections: → Chapter 18: The Attention Mechanism
Self-contained examples
each code file can be run independently - **Progressive framework usage:** NumPy (Ch 1–10) → PyTorch (Ch 11+) → HuggingFace (Ch 20+) → How to Use This Book
Self-Instruct / Evol-Instruct
Description: Methods for using a strong LLM to generate instruction-response pairs, optionally evolving them to increase complexity. Used to create datasets like Alpaca and WizardLM. - Chapters: 16, 17. → Appendix D: Data Sources and Datasets
sequence
it has no notion of order. Positional encodings inject position information into the token representations, breaking this permutation symmetry and allowing the model to distinguish between different orderings of the same tokens. → Quiz: The Attention Mechanism
Setup:
Input: **x** (shape: n_0 x 1) - Hidden layer: **W**^[1] (shape: n_1 x n_0), **b**^[1] (shape: n_1 x 1) - Output layer: **W**^[2] (shape: 1 x n_1), *b*^[2] (scalar) → Chapter 11: Neural Networks from Scratch
Skills Applied:
NumPy vectorized computation for feature engineering - pandas DataFrame operations for data cleaning and aggregation - matplotlib/seaborn visualization for exploratory data analysis - Method chaining for readable transformation pipelines - Code organization with type hints and docstrings → Case Study: Building an End-to-End Data Analysis Pipeline
Built an ETL pipeline using Apache Spark to normalize data from multiple hospital systems into a common FHIR-based format. - Implemented a data quality monitoring system that flagged anomalies and missing data patterns. - Established a de-identification pipeline that removed protected health informa → Case Study 1: From Rule-Based Systems to Deep Learning at a Healthcare Company
Spatial reduction attention
it reduces the spatial resolution of keys and values by a factor $R$ before computing attention, changing the complexity from $O(N^2)$ to $O(N^2/R)$. (2) **Hierarchical architecture** — it produces multi-scale features at 1/4, 1/8, 1/16, and 1/32 resolution, so attention at higher levels operates on → Chapter 26: Quiz — Vision Transformers and Modern Computer Vision
Speaker Verification:
**Equal Error Rate (EER)**: The point where false acceptance rate equals false rejection rate - **minDCF**: Minimum Detection Cost Function → Chapter 29: Speech, Audio, and Music AI
Specialized libraries:
**Hugging Face Transformers**: The de facto standard library for working with pre-trained transformer models. - **LangChain / LlamaIndex**: Frameworks for building applications powered by large language models. - **OpenCV**: The standard library for computer vision tasks. → Chapter 1: The Landscape of AI Engineering
Speech Recognition:
**Word Error Rate (WER)**: $\frac{S + D + I}{N}$ where $S$ = substitutions, $D$ = deletions, $I$ = insertions, $N$ = total reference words - **Character Error Rate (CER)**: Same as WER but at character level → Chapter 29: Speech, Audio, and Music AI
SQuAD (Stanford Question Answering Dataset)
URL: https://rajpurkar.github.io/SQuAD-explorer/ - Description: SQuAD 1.1 contains 100,000+ question-answer pairs based on Wikipedia passages (extractive QA). SQuAD 2.0 adds 50,000+ unanswerable questions. - License: CC BY-SA 4.0. - Chapters: 10, 12, 20. - Access: `datasets.load_dataset("squad")` → Appendix D: Data Sources and Datasets
Stable Diffusion / SDXL
Description: Open-source text-to-image models that can generate synthetic training images. Useful for data augmentation in computer vision. - Chapters: 22, 23. → Appendix D: Data Sources and Datasets
Stage 1: Feature Alignment Pre-training
Data: 595K image-text pairs from CC3M (filtered) - Only the projection layer $\mathbf{W}$ is trained; both the vision encoder and LLM are frozen - Objective: Image captioning (generate the caption given the image) - This stage teaches the projection layer to translate visual features into the LLM's → Chapter 28: Multimodal Models and Vision-Language AI
Stage 2: Visual Instruction Tuning
Data: 158K multimodal instruction-following examples generated using GPT-4 - The projection layer and the LLM are trained; the vision encoder remains frozen - Data includes conversations, detailed descriptions, and complex reasoning questions - This stage teaches the model to follow multimodal instr → Chapter 28: Multimodal Models and Vision-Language AI
Step 1: Calibration Data Preparation
Selected 256 representative conversations from production logs. - Ensured coverage of all financial advisory topics: portfolio allocation, tax planning, retirement, risk assessment. - Tokenized to match the model's expected input format. → Case Study 1: Deploying a Quantized LLM with vLLM
Structured reading habits:
**Daily arXiv scan**: Use tools like arXiv Sanity, Semantic Scholar alerts, or Papers With Code to surface relevant new papers. Aim to skim 5-10 abstracts daily and deep-read 1-2 papers weekly. - **Conference proceedings**: The top venues (NeurIPS, ICML, ICLR, ACL, CVPR) publish proceedings freely. → Chapter 40: The Future of AI Engineering
SuperGLUE
URL: https://super.gluebenchmark.com - Description: A more challenging successor to GLUE, including tasks like BoolQ (yes/no QA), CB (textual entailment with three classes), MultiRC (multi-sentence reading comprehension), and WiC (word-in-context). - Size: Varies; generally smaller than GLUE tasks. → Appendix D: Data Sources and Datasets
superposed
they represent multiple features simultaneously in overlapping directions. A sparse autoencoder with a large overcomplete basis can disentangle these superposed features into individual, interpretable units. → Chapter 16: Autoencoders and Representation Learning
a special instruction that defines the model's persona, capabilities, constraints, and behavioral guidelines. The system prompt is typically prepended to every conversation and is treated as higher-priority than user instructions. → Chapter 22: Scaling Laws and Large Language Models
a layered architecture of hardware, software, and services that work together to enable training, serving, and monitoring of AI models. Understanding this stack is essential for AI engineers, who must make informed decisions at every layer. → Chapter 1: The Landscape of AI Engineering
Temporal Features:
[ ] Extract all relevant time components (hour, day, month, day of week). - [ ] Apply cyclical encoding for periodic features. - [ ] Create lag, rolling, and expanding window features for time series. - [ ] Verify that no future information leaks into historical features. → Chapter 9: Feature Engineering and Data Pipelines
temporally overlapping positive pairs
since video narrations are loosely aligned with visual content, VideoCLIP samples overlapping (but not identical) time windows for positive pairs, creating a softer contrastive signal that handles the inherent temporal misalignment in narrated video. → Chapter 30: Video Understanding and Generation
Text Features:
[ ] Preprocess text (lowercasing, removing noise, tokenizing). - [ ] Choose between BoW, TF-IDF, and embeddings based on the task and data volume. - [ ] Consider n-grams for capturing multi-word patterns. → Chapter 9: Feature Engineering and Data Pipelines
The convergence trajectory:
**2020-2022**: Separate models for each modality (GPT-3 for text, DALL-E for images, Whisper for audio). - **2023-2024**: Multimodal models that handle two or three modalities (GPT-4V for text+images, Gemini for text+images+video). - **2025+**: Omni-modal models that natively process and generate al → Chapter 40: The Future of AI Engineering
The current landscape (2025):
*Closed frontier*: Models from Anthropic (Claude), OpenAI (GPT-4, o3), and Google (Gemini) remain the most capable, particularly for complex reasoning tasks. - *Open-weight frontier*: Meta's Llama 3.1 (405B), DeepSeek-V3, Mistral Large, and others provide openly available weights that match or appro → Chapter 40: The Future of AI Engineering
The Pile
Description: An 800GB curated dataset combining 22 high-quality sub-datasets (academic papers, books, code, etc.) designed for LLM pre-training. - License: MIT (compilation); individual components vary. - Chapters: 14, 15. → Appendix D: Data Sources and Datasets
Token pruning/merging
reduce sequence length by removing or combining less important tokens. (4) **Linear attention** — approximate softmax attention with kernel methods for $O(Nd^2)$ cost. (5) **FlashAttention** — IO-aware implementation that doesn't reduce FLOPs but dramatically improves wall-clock time and memory usag → Chapter 26: Quiz — Vision Transformers and Modern Computer Vision
tokenizer
the algorithm that converts raw text into the subword units the model processes. Unlike traditional word-level tokenization, modern approaches operate at the **subword** level, balancing vocabulary size with the ability to represent any text. → Chapter 20: Pre-training and Transfer Learning for NLP
small shifts in the input lead to the same (or very similar) max values. This complements the translation *equivariance* of the convolution operation itself. → Chapter 14: Convolutional Neural Networks
TriviaQA
Description: 650,000 question-answer-evidence triples gathered from trivia and quiz-league websites. - Chapters: 20, 26. → Appendix D: Data Sources and Datasets
TTS:
**Mean Opinion Score (MOS)**: Subjective human rating on a 1-5 scale - **Mel Cepstral Distortion (MCD)**: Objective measure of spectral distance - **PESQ/POLQA**: Perceptual evaluation of speech quality → Chapter 29: Speech, Audio, and Music AI
Typical findings for BERT-like models:
**Layer 0 (embeddings)**: Encodes surface features (word identity, position). POS tagging probes already achieve moderate accuracy. - **Layers 1-4**: Syntactic information (POS tags, dependency relations, constituency) is maximally represented. Probing accuracy for syntactic tasks peaks in this rang → Chapter 38: Interpretability, Explainability, and Mechanistic Understanding
U
UCI Machine Learning Repository
URL: https://archive.ics.uci.edu - Description: A long-standing collection of 600+ datasets for machine learning research. Includes classics like Iris, Wine, Adult Census, and Boston Housing. - License: Varies; most are freely available for research. - Chapters: 3, 4, 5, 6. → Appendix D: Data Sources and Datasets
UltraFeedback
Description: 64,000 instructions with responses from multiple models, scored on helpfulness, honesty, instruction following, and truthfulness. - Chapters: 16, 17. → Appendix D: Data Sources and Datasets
Understand the domain
What do the raw features mean? What relationships might exist? 2. **Transform features** --- Apply mathematical transformations, encode categories, extract temporal patterns. 3. **Create new features** --- Combine existing features, compute aggregates, derive domain-specific indicators. 4. **Select → Chapter 9: Feature Engineering and Data Pipelines
Your task has a fixed output format (classification, token labeling, span extraction). - You need fast inference (encoder-only models are faster than encoder-decoder for classification). - Your fine-tuning dataset is small and you want maximum parameter efficiency. → Chapter 20: Pre-training and Transfer Learning for NLP
Building production systems - Experimenting with architectures - Training on GPUs - Using pre-built layers, losses, and optimizers - Collaborating with others (PyTorch is the de facto standard in research) → Case Study 2: Comparing NumPy and PyTorch Implementations
Use T5 when:
Your task requires generating variable-length text output (summarization, translation, open-ended QA). - You want to use a single model architecture for multiple different tasks. - You want to frame new tasks flexibly --- just choose a new text prefix. - You are comfortable with the slightly higher → Chapter 20: Pre-training and Transfer Learning for NLP
Use the NumPy approach when:
Learning the fundamentals (this chapter) - Debugging mysterious gradient behavior - Teaching or explaining neural networks - Implementing custom operations not available in PyTorch → Case Study 2: Comparing NumPy and PyTorch Implementations
V
Variants of activation patching:
**Resample ablation**: Replace the activation with a sample from its empirical distribution (rather than from a specific corrupted input), measuring the component's overall importance. - **Path patching**: Patch activations along specific computational paths (e.g., from one attention head to a speci → Chapter 38: Interpretability, Explainability, and Mechanistic Understanding
Visual Question Answering (VQA)
URL: https://visualqa.org - Description: 265,000 images with 760,000+ questions and 10 ground-truth answers each. - Chapters: 23. → Appendix D: Data Sources and Datasets
VoxCeleb / VoxCeleb2
Description: Speaker recognition datasets containing speech from thousands of celebrities. VoxCeleb2 has over 1 million utterances from 6,000+ speakers. - Chapters: 24. → Appendix D: Data Sources and Datasets
W
Warmup strategies:
**Linear warmup** (most common): The learning rate increases linearly from 0 to the target - **Exponential warmup**: The learning rate increases exponentially, spending more time at low rates - **Gradual warmup** (Goyal et al., 2017): For very large batch training, warmup over 5--10 epochs → Chapter 12: Training Deep Networks
directly into the architecture. These biases dramatically reduce parameter count while making networks equivariant to spatial translations. CNNs have dominated computer vision for over a decade, and their principles extend to audio, text, and scientific computing. → Chapter 14 Key Takeaways
What to look for:
**Activation means** should be near zero (for zero-centered activations like tanh) or near 0.5 times the standard deviation (for ReLU) - **Activation standard deviations** should be roughly constant across layers (not growing or shrinking) - **Dead fraction** (fraction of neurons outputting exactly → Chapter 12: Training Deep Networks
When NOT to use it:
During actual training (far too slow---requires 2 forward passes per parameter) - In production code - On very large networks (impractical) → Quiz: Neural Networks from Scratch
When to use it:
When implementing backpropagation from scratch, to verify your analytical gradients are correct - When debugging a custom layer or loss function - As a one-time verification step during development → Quiz: Neural Networks from Scratch
When to Use PCA:
**Preprocessing for supervised learning**: Reduce feature dimensionality before training a classifier or regressor (Chapters 5-6). This can reduce overfitting and speed up training. - **Visualization**: Project high-dimensional data to 2D or 3D for exploration. - **Noise reduction**: Discarding low- → Chapter 7: Unsupervised Learning and Dimensionality Reduction
When to use which:
**Mean:** When data is approximately normally distributed and MCAR. - **Median:** When data is skewed or contains outliers (preferred default). - **Most frequent:** For categorical features. - **Constant:** When you want the model to learn that "missing" is a distinct category. → Chapter 9: Feature Engineering and Data Pipelines
X
XSum (Extreme Summarization)
Description: 227,000 BBC news articles with single-sentence summaries. Tests models' ability to generate highly abstractive summaries. - Access: `datasets.load_dataset("xsum")` - Chapters: 12, 15. → Appendix D: Data Sources and Datasets
Z
zero breaks symmetry
all neurons compute identical gradients and never differentiate. - Biases are almost always initialized to zero. - Proper initialization keeps activation variance and gradient variance stable across layers. → Chapter 12 Key Takeaways