Glossary: Artificial Intelligence Engineering

#

(32, 64): (n_out, n_in) - **a**^[l-1] has shape **(64, 16)** --- (n_in, batch_size) - **b**^[l] has shape **(32, 1)** --- (n_out, 1), broadcast across the 16 columns - **z**^[l] has shape **(32, 16)** --- (n_out, batch_size) → Quiz: Neural Networks from Scratch
1. ML/AI Research Scientist: Focus: Advance the state of the art through novel algorithms and architectures. - Skills: Deep mathematical foundations, experimental design, scientific writing. - Trajectory: PhD → research lab → principal researcher → research director. → Chapter 40: The Future of AI Engineering
1. Talent: The local job market (a mid-sized city in the southeastern United States) had limited AI/ML talent. - Competing for experienced ML engineers against major tech companies and well-funded startups was difficult. - Existing engineering staff had strong software skills but limited ML experience. → Case Study 2: Building an AI Engineering Team from Scratch
1.1 Data Collection: Collect raw domain text from at least two different sources. - Document each source: URL, license/terms of use, approximate size, collection method. - Collect at least 50,000 raw text passages (before filtering). → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
1.1 Document Parsing: Support at least three document formats: PDF, HTML, and Markdown. - Use appropriate parsing libraries (e.g., `pymupdf` or `pdfplumber` for PDF, `beautifulsoup4` for HTML). - Extract and preserve document structure: titles, headings, paragraphs, tables, and lists. - Handle encoding issues, malformed → Capstone Project 1: Build a Production RAG System with Guardrails
1.1 Image Understanding Pipeline: Implement image captioning using a vision-language model. Recommended models: - LLaVA 1.6 (open source, strong performance). - InternVL2 (open source, strong on benchmarks). - GPT-4o / Claude via API (highest quality, closed source). - Implement visual question answering (VQA): given an image and a → Capstone Project 3: End-to-End Multimodal AI Application
1.2 Chunking Strategy: Implement at least two chunking strategies: - Fixed-size chunking with configurable overlap (e.g., 512 tokens with 64-token overlap). - Semantic chunking that respects document structure (split at heading boundaries, paragraph breaks). - Each chunk must carry metadata: source document ID, chunk inde → Capstone Project 1: Build a Production RAG System with Guardrails
1.2 Data Cleaning and Filtering: Remove duplicates (exact and near-duplicate using MinHash or similar). - Filter for quality using at least two heuristics: - Minimum length (e.g., at least 50 words). - Language detection (ensure all text is in the target language). - Perplexity filtering: use a reference language model to remove te → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
1.2 Document Processing Pipeline: Process complex documents (PDFs, Word documents) that contain: - Running text paragraphs. - Tables (extract to structured format). - Figures and charts (extract images, generate captions). - Headers and structural elements. - For each extracted element, maintain: - The content (text or image). - The → Capstone Project 3: End-to-End Multimodal AI Application
1.3 Embedding Generation: Generate dense embeddings using a sentence transformer model (e.g., `all-MiniLM-L6-v2` for prototyping, `bge-large-en-v1.5` or `gte-large` for production). - Implement batched embedding generation with progress tracking. - Store embeddings in a vector database (Qdrant, ChromaDB, or Weaviate). - Buil → Capstone Project 1: Build a Production RAG System with Guardrails
1.3 Instruction Dataset Creation: Create an instruction-tuning dataset in the conversational format expected by your chosen base model. Each example should include: - A system message (optional, defining the domain expert role). - A user instruction/question. - An assistant response. - Create examples using a combination of: - **Exi → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
1.3 Multimodal Embedding: Generate embeddings for both text and images in a shared embedding space. - Use a model such as CLIP (`openai/clip-vit-large-patch14`) or SigLIP for joint text-image embeddings. - For text-only content, also generate text embeddings using a sentence transformer (for higher-quality text retrieval). - → Capstone Project 3: End-to-End Multimodal AI Application
1.4 Data Format: Store data in JSONL format with the following schema: ```json { "messages": [ {"role": "system", "content": "You are an expert in [domain]..."}, {"role": "user", "content": "What is..."}, {"role": "assistant", "content": "Based on..."} ], "source": "pubmed_qa", "quality_score": 0.92 } ``` - Create a → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
1.4 Input Validation: Validate all inputs: - Images: Check file format, size limits (e.g., max 20MB), resolution limits, corrupt file detection. - Text: Length limits, encoding validation. - Documents: Format validation, page count limits, malware scanning (optional). - Return informative error messages for invalid input → Capstone Project 3: End-to-End Multimodal AI Application
1.4 Metadata Storage: Store document-level metadata in a relational database (SQLite for development, PostgreSQL for production). - Track: document ID, filename, source URL, ingestion time, number of chunks, format, file hash (for deduplication). - Implement idempotent ingestion: re-ingesting the same document should upd → Capstone Project 1: Build a Production RAG System with Guardrails
2. Infrastructure: No GPU compute infrastructure existed. - The data warehouse was a traditional SQL Server-based system optimized for reporting, not for ML training workloads. - No experiment tracking, model registry, or model serving infrastructure was in place. → Case Study 2: Building an AI Engineering Team from Scratch
2. ML/AI Engineer: Focus: Build and deploy production ML systems. - Skills: Software engineering, MLOps, system design, debugging at scale. - Trajectory: SDE → ML engineer → senior/staff ML engineer → engineering manager. → Chapter 40: The Future of AI Engineering
2.1 Base Model Selection: Choose an open-source base model. Recommended options: - **Llama 3.1 8B Instruct**: Strong baseline, well-documented. - **Mistral 7B Instruct v0.3**: Efficient architecture with sliding window attention. - **Gemma 2 9B Instruct**: Strong performance per parameter. - Document your selection rationale → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
2.1 Knowledge Base Ingestion: Build an ingestion pipeline that processes a collection of documents and images into the multimodal knowledge base. - For each ingested item, store: - The original content (or a reference to it). - Text embedding(s). - Image embedding(s), if the content contains images. - CLIP embedding (for cross-m → Capstone Project 3: End-to-End Multimodal AI Application
2.1 Query Processing: Implement query rewriting using an LLM: given a potentially ambiguous user query, generate an improved search query. - Implement query expansion: generate 2-3 variant queries to increase recall. - Support conversation context: for follow-up questions, resolve coreferences using chat history. → Capstone Project 1: Build a Production RAG System with Guardrails
2.2 Cross-Modal Retrieval: Implement the following retrieval modes: - **Text-to-text**: Standard text search (dense + BM25 hybrid). - **Text-to-image**: Find relevant images given a text query (using CLIP embeddings). - **Image-to-text**: Find relevant text given an image query (using CLIP embeddings). - **Image-to-image**: F → Capstone Project 3: End-to-End Multimodal AI Application
2.2 Hybrid Retrieval: Implement dense retrieval using the vector database (cosine similarity or dot product). - Implement sparse retrieval using BM25. - Combine results using Reciprocal Rank Fusion (RRF): `score(d) = sum_r 1 / (k + rank_r(d))` where k is typically 60 and the sum is over all retrieval methods. - Retrieve → Capstone Project 1: Build a Production RAG System with Guardrails
2.2 LoRA Configuration: Implement fine-tuning using the HuggingFace `peft` and `trl` libraries. - Configure LoRA with the following as starting points (then experiment): - Rank (r): 8, 16, 32, 64. - Alpha: 16, 32 (typically 2x rank). - Target modules: attention projections (`q_proj`, `k_proj`, `v_proj`, `o_proj`) and optio → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
2.3 Multi-Modal Fusion: When a query involves both text and images (e.g., "Find documents similar to this one" with an uploaded document containing text and figures), combine retrieval results across modalities. - Implement a fusion strategy: - Reciprocal Rank Fusion across modalities. - Or a weighted combination where wei → Capstone Project 3: End-to-End Multimodal AI Application
2.3 Re-Ranking: Implement a cross-encoder re-ranker (e.g., `cross-encoder/ms-marco-MiniLM-L-6-v2`) that scores each (query, passage) pair. - The re-ranker should take the top-20 fused results and output the top-5. - Log retrieval latency breakdown: dense search time, sparse search time, fusion time, re-ranking time → Capstone Project 1: Build a Production RAG System with Guardrails
2.3 Training Configuration: Optimizer: AdamW with weight decay 0.01. - Learning rate: Sweep over {5e-6, 1e-5, 2e-5, 5e-5} with cosine schedule. - Warmup: 10% of total steps. - Batch size: Effective batch size of 16-64 (use gradient accumulation as needed). - Maximum sequence length: 2048 tokens (or 4096 if resources allow). - → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
2.4 Experiment Tracking: Track all experiments using Weights & Biases (wandb) or MLflow. - For each run, log: hyperparameters, training loss curve, validation loss curve, learning rate schedule, GPU utilization, peak memory, training time. - Run at least 6 experiments varying: - LoRA rank (at least 3 values). - Learning rat → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
2.4 Retrieval Evaluation: Implement retrieval evaluation using a manually curated set of at least 25 query-document relevance pairs. - Compute metrics: Recall@5, Recall@10, MRR@10, NDCG@10. - Compare: dense-only, sparse-only, hybrid without re-ranking, and hybrid with re-ranking. - Present results in a table showing the impa → Capstone Project 1: Build a Production RAG System with Guardrails
2.5 Training Best Practices: Implement early stopping based on validation loss. - Save checkpoints at regular intervals. - Use gradient clipping (max norm 1.0). - Monitor for catastrophic forgetting by periodically evaluating on a general-purpose benchmark (e.g., a subset of MMLU). → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
20 tokens per parameter: an order of magnitude more data-intensive than Kaplan's recommendation. → Chapter 22: Scaling Laws and Large Language Models
200,960: Layer 2: 256 * 128 weights + 128 biases = 32,768 + 128 = **32,896** - Layer 3: 128 * 10 weights + 10 biases = 1,280 + 10 = **1,290** → Quiz: Neural Networks from Scratch
3. AI Infrastructure Engineer: Focus: Build the platforms, frameworks, and tooling that ML teams depend on. - Skills: Distributed systems, GPU programming, compiler optimization, cloud architecture. - Trajectory: Systems engineer → AI infra engineer → architect → VP of engineering. → Chapter 40: The Future of AI Engineering
3. Organizational Readiness: Business stakeholders had vague expectations ("We need AI") without specific, well-defined use cases. - Regulatory compliance (banking regulators require model risk management, explainability, and audit trails for models used in lending decisions) imposed constraints that most AI tutorials and blog → Case Study 2: Building an AI Engineering Team from Scratch
3.1 Automated Evaluation: Evaluate on the held-out test set using: - **Perplexity**: Compare base model vs. fine-tuned model on domain text. - **Generation quality**: ROUGE-1, ROUGE-L, and BERTScore against reference answers. - **Exact match / F1**: For extractive QA-style questions. - **Domain terminology accuracy**: Check → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
3.1 Intent Classification and Routing: Implement an intent classifier that categorizes user requests into types: - **Question answering**: User asks a question about uploaded or stored content. - **Search**: User wants to find specific content in the knowledge base. - **Analysis**: User wants detailed analysis of an uploaded image or doc → Capstone Project 3: End-to-End Multimodal AI Application
3.1 Prompt Engineering: Design a system prompt that instructs the LLM to: - Answer based only on the provided context. - Cite sources using bracketed references (e.g., [1], [2]). - State "I don't have enough information to answer this question" when the context is insufficient. - Maintain a professional, helpful tone. - Im → Capstone Project 1: Build a Production RAG System with Guardrails
3.2 Domain-Specific Benchmark: Create or adapt a domain-specific benchmark with at least 100 questions spanning: - Factual recall (e.g., "What is the standard treatment for X?"). - Reasoning (e.g., "Given these symptoms, what is the most likely diagnosis?"). - Summarization (e.g., "Summarize the key findings of this report."). - → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
3.2 LLM Integration: Support at least two LLM backends: - A cloud API (OpenAI GPT-4, Anthropic Claude, or equivalent). - A local model served via vLLM or Ollama (e.g., Llama 3.1 8B, Mistral 7B). - Implement a clean abstraction layer so backends can be swapped via configuration. - Support streaming responses (server-sent → Capstone Project 1: Build a Production RAG System with Guardrails
3.2 Tool Integration: Define and implement at least four tools that the agent can invoke: - `search_knowledge_base(query, modality_filter, top_k)` -- Retrieve from the knowledge base. - `analyze_image(image, question)` -- Get detailed analysis of an image. - `extract_text(image_or_document)` -- OCR and text extraction. - → Capstone Project 3: End-to-End Multimodal AI Application
3.3 Citation Generation: Each answer must include inline citations referencing specific retrieved chunks. - After the answer, include a "Sources" section listing the cited documents with their metadata (title, page number, etc.). - Implement citation verification: check that cited chunk IDs actually exist in the retrieved r → Capstone Project 1: Build a Production RAG System with Guardrails
3.3 Human Evaluation: Conduct human evaluation on at least 50 test examples. - For each example, a human rater (you, a teammate, or a recruited evaluator) scores responses from the base model and the fine-tuned model on: - **Correctness** (1-5): Is the information factually accurate? - **Helpfulness** (1-5): Does the res → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
3.3 Multi-Step Reasoning: Implement a ReAct-style agent loop: 1. **Think**: The LLM reasons about what to do next. 2. **Act**: The LLM selects a tool and provides arguments. 3. **Observe**: The tool result is returned to the LLM. 4. Repeat until the task is complete or a maximum number of steps is reached. - The agent should → Capstone Project 3: End-to-End Multimodal AI Application
3.4 Context Window Management: Implement intelligent context truncation: when retrieved chunks exceed the model's context window, prioritize higher-ranked chunks. - Track and log token usage for each request (prompt tokens, completion tokens, total tokens). - Estimate cost per request. → Capstone Project 1: Build a Production RAG System with Guardrails
3.4 Conversation Management: Maintain conversation state across multiple turns. - Support multimodal conversation history: the agent should remember and reference previously uploaded images and documents within the session. - Implement context window management: summarize or truncate older conversation history when approaching → Capstone Project 3: End-to-End Multimodal AI Application
3.4 Safety and Robustness Evaluation: Test for: - Hallucination rate: Fraction of responses containing fabricated information. - Refusal appropriateness: Does the model refuse to answer questions outside its expertise? - Adversarial robustness: Test with intentionally misleading or adversarial prompts. - General capability retention: Ev → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
3.5 Error Analysis: Manually categorize errors from the test set into types: - Factual errors (wrong information). - Incomplete answers (missing key details). - Hallucinations (fabricated information). - Format errors (wrong structure or style). - Refusal errors (refuses a legitimate question). - For each error type, p → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
3.5 Error Handling and Recovery: When a tool call fails, the agent should: - Log the error. - Attempt an alternative approach (different tool, different parameters). - Inform the user if recovery is not possible. - When the agent cannot answer a question, it should clearly state this rather than fabricating an answer. → Capstone Project 3: End-to-End Multimodal AI Application
3c. Client-Side Optimization: **Speculative prefetch**: The IDE plugin prefetched completions as the user typed, predicting likely pause points. - **Debouncing**: Requests were only sent after a 50ms typing pause to avoid unnecessary calls during fast typing. - **Caching**: Recent completions were cached client-side and reused i → Case Study 2: Optimizing Inference Latency for Production
4. AI Product Manager: Focus: Translate AI capabilities into user-facing products. - Skills: Product sense, technical literacy, user research, business strategy. - Trajectory: PM → AI PM → director of AI products → VP of product. → Chapter 40: The Future of AI Engineering
4. Proving Value: The board expected tangible ROI within 18 months. - Without quick wins, the initiative risked losing organizational support and budget. → Case Study 2: Building an AI Engineering Team from Scratch
4.1 Input Guardrails: **Query validation**: Reject empty queries, queries exceeding maximum length, and queries that consist solely of special characters. - **PII detection**: Detect and optionally redact personally identifiable information (email addresses, phone numbers, SSNs) from user queries before they are sent to → Capstone Project 1: Build a Production RAG System with Guardrails
4.1 Merge and Export: Merge the LoRA adapter weights into the base model to create a standalone model. - Save the merged model in HuggingFace format. - Verify that the merged model produces identical outputs to the adapter-based model. → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
4.1 Text Guardrails: Implement the following (reusing and extending techniques from Capstone 1 where applicable): - Input validation and sanitization. - Prompt injection detection. - PII detection and redaction. - Toxicity filtering on outputs. - Topic boundary enforcement (keep responses within the application's domain → Capstone Project 3: End-to-End Multimodal AI Application
4.2 Image Guardrails: Implement content moderation for uploaded images: - NSFW detection using a pre-trained classifier (e.g., `Falconsai/nsfw_image_detection` or the `safety_checker` from Stable Diffusion). - Violence/gore detection. - Configurable strictness levels. - For generated image descriptions: - Verify descript → Capstone Project 3: End-to-End Multimodal AI Application
4.2 Output Guardrails: **Faithfulness check**: Implement a check that verifies the generated answer is supported by the retrieved context. Approaches include: - NLI-based: Use a Natural Language Inference model to check entailment between context and answer. - LLM-based: Ask a separate LLM call to verify faithfulness. - * → Capstone Project 1: Build a Production RAG System with Guardrails
4.2 Quantization Methods: Apply at least two of the following quantization methods: - **GPTQ** (4-bit, 128 group size): Use the `auto-gptq` library with a calibration dataset of 128-256 examples from your training data. - **AWQ** (4-bit): Use the `autoawq` library. Compare with GPTQ on quality and speed. - **GGUF** (for llam → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
4.3 Fallback Behavior: When guardrails are triggered, return a graceful, informative error message rather than a generic error. - Log all guardrail activations with the triggering input/output for later review. - Implement a configurable guardrail strictness level (strict, moderate, permissive). → Capstone Project 1: Build a Production RAG System with Guardrails
4.3 Multimodal Safety: Address cross-modal attacks: - Images with hidden text designed to manipulate the vision-language model (visual prompt injection). - Documents with malicious content embedded in images. - Implement output consistency checks: if the system references an image, verify the textual description matches t → Capstone Project 3: End-to-End Multimodal AI Application
4.3 Quality-Speed Tradeoff Analysis: Create a plot showing: x-axis = inference speed (tokens/sec), y-axis = quality metric (e.g., average score), with each point labeled by quantization method and bit width. - Identify the best quantization method for your use case (the one offering the best quality-speed tradeoff). - Document any qual → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
4.4 Bias Monitoring: Implement basic bias monitoring for the vision-language pipeline: - Track whether image descriptions exhibit demographic biases (e.g., making assumptions about gender or ethnicity from appearance). - Log demographic-related terms in generated descriptions for periodic review. - Document known biases → Capstone Project 3: End-to-End Multimodal AI Application
4.5 Privacy: Implement data retention policies: configurable auto-deletion of uploaded content after a specified period. - Ensure uploaded images and documents are not sent to external APIs without user consent (provide a local-only mode). - Log access to stored content for audit purposes. → Capstone Project 3: End-to-End Multimodal AI Application
5. AI Safety and Alignment Researcher: Focus: Ensure AI systems behave as intended and do not cause harm. - Skills: Formal methods, game theory, interpretability, ethics, policy. - Trajectory: Research assistant → researcher → research lead → head of safety. → Chapter 40: The Future of AI Engineering
5.1 API Design: Build the API using FastAPI with the following endpoints: - `POST /query` -- Submit a question and receive an answer with sources. - `POST /query/stream` -- Submit a question and receive a streaming response (SSE). - `POST /ingest` -- Upload a document for ingestion. - `POST /feedback` -- Submit use → Capstone Project 1: Build a Production RAG System with Guardrails
5.1 Serving Engine Setup: Deploy the model using vLLM (recommended) or HuggingFace Text Generation Inference (TGI). - Configure: - Tensor parallelism (if multiple GPUs are available). - Maximum model length. - GPU memory utilization target (e.g., 90%). - Maximum number of concurrent requests. - Verify the model loads and gen → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
5.1 User Interface: Build a web-based user interface using Streamlit, Gradio, or React. - The UI must support: - Text input (chat-style). - Image upload (drag-and-drop or file picker). - Document upload (PDF, Word). - Display of multimodal responses: text with inline images, tables, and citations. - Conversation histor → Capstone Project 3: End-to-End Multimodal AI Application
5.2 API Design: Implement a complete REST API (FastAPI) with the following endpoints: - `POST /chat` -- Send a message (text and/or images) and receive a response. - `POST /upload` -- Upload a document or image to the knowledge base. - `POST /search` -- Search the knowledge base. - `GET /session/{session_id}` -- Re → Capstone Project 3: End-to-End Multimodal AI Application
5.2 API Layer: Build a FastAPI wrapper around the serving engine with endpoints: - `POST /generate` -- Single completion request. - `POST /generate/stream` -- Streaming completion (SSE). - `POST /chat` -- Multi-turn chat completion (handles message formatting). - `GET /health` -- Health check (model loaded, GPU av → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
5.2 Observability: Log every request with: request ID, timestamp, query, retrieval latency, generation latency, total latency, token count, guardrail outcomes, and model used. - Implement structured logging (JSON format) using Python's `logging` module or `structlog`. - Track the following metrics: - Request rate (req → Capstone Project 1: Build a Production RAG System with Guardrails
5.3 Containerization: Provide a `Dockerfile` for the application. - Provide a `docker-compose.yml` that launches the API, vector database, and any other dependencies. - Include environment variable configuration for all secrets and model paths. → Capstone Project 1: Build a Production RAG System with Guardrails
5.3 Infrastructure: Provide Docker Compose configuration that launches all services: - Application server (FastAPI + agent logic). - Vector database (Qdrant or ChromaDB). - Metadata database (PostgreSQL or SQLite). - (Optional) Local model server (Ollama or vLLM). - Document all environment variables and configuration → Capstone Project 3: End-to-End Multimodal AI Application
5.3 Performance Optimization: Benchmark the serving setup: - Latency: Time to first token (TTFT) and inter-token latency (ITL) at various input lengths (128, 512, 1024, 2048 tokens). - Throughput: Maximum requests per second at different concurrency levels (1, 4, 8, 16, 32 concurrent requests). - Use a load-testing tool (e.g., ` → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
5.4 Monitoring and Logging: Log every request: timestamp, input length, output length, latency, tokens/second, model parameters (temperature, top_p). - Track GPU utilization and memory in real time. - Implement a simple dashboard or log summary script. → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
5.4 Performance: Benchmark end-to-end latency for common operations: - Text-only question: target < 5 seconds. - Image analysis question: target < 10 seconds. - Document ingestion (10-page PDF): target < 60 seconds. - Knowledge base search: target < 2 seconds. - Identify and document performance bottlenecks. - Imple → Capstone Project 3: End-to-End Multimodal AI Application
5.5 Containerization: Create a `Dockerfile` for the complete serving stack. - Provide a `docker-compose.yml` if multiple services are involved. - Document GPU passthrough configuration for Docker. - Include a startup script that handles model download/loading. → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
5.5 Monitoring: Implement structured logging for all operations. - Track and report: - Request volume by type. - Latency breakdown by pipeline stage (vision, retrieval, generation). - Error rates by type and severity. - Model usage and cost estimates. - Storage usage (knowledge base size). - User feedback statistic → Capstone Project 3: End-to-End Multimodal AI Application
6. AI Ethics and Policy Specialist: Focus: Navigate the regulatory, ethical, and societal dimensions of AI. - Skills: Law, policy analysis, stakeholder engagement, technical understanding. - Trajectory: Policy analyst → AI policy lead → chief AI ethics officer. → Chapter 40: The Future of AI Engineering
6.1 End-to-End Evaluation: Create a test set of at least 50 questions with ground-truth answers and source documents. - Evaluate with the following metrics: - **Answer quality**: Human evaluation on a 1-5 scale for correctness, completeness, and clarity. Also compute automated metrics (ROUGE-L, BERTScore against reference ans → Capstone Project 1: Build a Production RAG System with Guardrails
6.2 Ablation Study: Measure the impact of each component by disabling it and re-evaluating: - Without query rewriting. - Without re-ranking. - Without output guardrails. - Dense-only vs. hybrid retrieval. - Present results in a comparison table. → Capstone Project 1: Build a Production RAG System with Guardrails
6.2 Human Evaluation: Recruit 2-3 evaluators (can be teammates or volunteers). - Have each evaluator interact with the system on 20 tasks (covering all scenario types). - Collect ratings (1-5) for: - **Correctness**: Is the response factually accurate? - **Relevance**: Does the response address the user's actual question → Capstone Project 3: End-to-End Multimodal AI Application
6.2 Live Demo: Demonstrate the deployed model answering domain-specific questions. - Show side-by-side comparison with the base model. - Demonstrate the monitoring dashboard. - Be prepared to answer questions about design decisions, failure modes, and alternative approaches. → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
6.3 Code Repository: All code in a clean, well-organized repository. - README with setup instructions. - Requirements file or environment specification. - All scripts should be runnable with clear documentation. → Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
6.3 Comparative Analysis: Compare your system's performance to: - A text-only baseline (same system without image understanding). - A no-agent baseline (direct LLM call without tool use). - (If budget allows) A commercial API baseline (GPT-4o or Claude with vision). - Present comparisons in tables and/or charts. → Capstone Project 3: End-to-End Multimodal AI Application
6.3 Documentation: System architecture document with diagrams. - API reference (auto-generated plus any additional notes). - Deployment guide (local development, Docker, and cloud deployment instructions). - Known limitations and future work. → Capstone Project 1: Build a Production RAG System with Guardrails
6.4 Model Card and System Documentation: Produce a model card for the overall system including: - Intended use cases and users. - Input types and limitations. - Known failure modes (with examples). - Bias and fairness considerations. - Environmental impact estimate. - Produce a system documentation package: - Architecture document with dia → Capstone Project 3: End-to-End Multimodal AI Application
6.6 Presentation and Demo: Prepare a 20-minute presentation covering the key aspects of the project. - Include a live demo showing: - A text-only interaction. - An image understanding interaction. - A multi-step task requiring agent capabilities. - A cross-modal search. - A guardrail activation. - Be prepared for Q&A. → Capstone Project 3: End-to-End Multimodal AI Application
|: 71% (P50), -68% (P95)** | **+200%** | **-63%** | → Case Study 2: Optimizing Inference Latency for Production

A

Active learning practices:: **Paper reimplementation**: Implementing a paper from scratch forces understanding that reading alone cannot achieve. Start with older, well-documented papers and work toward recent ones. - **Kaggle competitions**: Provide structured problems with well-defined evaluation, exposing you to practical t → Chapter 40: The Future of AI Engineering
Advantages:: Stable convergence with a smooth loss curve - Guaranteed to converge to a local minimum for convex functions (with appropriate learning rate) → Chapter 3: Calculus, Optimization, and Automatic Differentiation
After alignment:: Safety benchmarks: TruthfulQA, HarmBench, BBQ - Capability benchmarks: MMLU, HumanEval, GSM8K (to detect capability regression) - User satisfaction: if in production, track thumbs up/down ratings → Chapter 25: Alignment: RLHF, DPO, and Beyond
AI engineering: is still taking shape. Unlike software engineering, which has had decades to formalize its practices, or data science, which crystallized into a recognized profession in the 2010s, AI engineering sits at a dynamic intersection of mathematics, computer science, systems design, and domain expertise. I → Chapter 1: The Landscape of AI Engineering
AI has evolved through distinct eras: symbolic AI, classical machine learning, deep learning, and the transformer/foundation model era --- each contributing ideas and techniques that remain relevant today. → Chapter 1: The Landscape of AI Engineering
AI is a constellation of subfields: machine learning, NLP, computer vision, robotics, speech processing, and generative AI --- that increasingly intersect and combine. → Chapter 1: The Landscape of AI Engineering
Albumentations: Description: Fast image augmentation library. Provides geometric and photometric transformations for creating augmented training samples. - Usage: `pip install albumentations` - Chapters: 8, 22. → Appendix D: Data Sources and Datasets
Anthropic API: Description: Access to Claude models. Used for generating high-quality synthetic data, especially for complex reasoning tasks. - Chapters: 15, 17, 30. → Appendix D: Data Sources and Datasets
Anthropic HH-RLHF: Description: Human preference data for helpfulness and harmlessness. Contains pairs of model responses with human preference labels. - Chapters: 16, 17. → Appendix D: Data Sources and Datasets
Architecture tips:: Use transposed convolutions with kernel size divisible by stride to avoid checkerboard artifacts. Better yet, use nearest-neighbor upsampling followed by a regular convolution. - Avoid max pooling in the discriminator; use strided convolutions instead. - Use spectral normalization in the discriminat → Chapter 17: Generative Adversarial Networks
area: the product of these two features. Creating `lot_area = lot_length * lot_width` gives the model direct access to this relationship without requiring it to learn a multiplicative interaction from additive terms. → Chapter 9: Feature Engineering and Data Pipelines
Atom features (per node):: Atom type: one-hot encoding of C, N, O, S, F, Cl, Br, I, P, Other (10 dims) - Degree: node degree / 4 (1 dim) - Formal charge: charge / 2 (1 dim) - Number of hydrogens: count / 4 (1 dim) - Is aromatic: binary (1 dim) - Is in ring: binary (1 dim) → Case Study 1: Molecular Property Prediction with GNNs
Atrous Spatial Pyramid Pooling (ASPP): parallel dilated convolutions at multiple rates that capture multi-scale context. This addresses a fundamental challenge in segmentation: objects appear at many different scales, and the network must recognize both a small bird and a large building in the same image. → Chapter 14: Convolutional Neural Networks
Audio Classification:: **mean Average Precision (mAP)**: Standard for multi-label classification (AudioSet) - **Accuracy**: For single-label tasks → Chapter 29: Speech, Audio, and Music AI
audio tokenization: converting continuous audio into discrete tokens that can be modeled with standard language model architectures. This mirrors how BPE tokenization converts continuous text into discrete tokens for language models (as we saw in Chapter 3), creating a universal interface between raw signals and transf → Chapter 29: Speech, Audio, and Music AI
Avoid:: {Anti-pattern 1} - {Anti-pattern 2} → ═══════════════════════════════════════════════════════════════════════════════
Avoiding common traps:: *Breadth without depth*: Following every new trend without mastering any. Depth creates career capital; breadth provides context. - *Tutorial paralysis*: Endlessly following tutorials without building original projects. Tutorials are training wheels; at some point, you must remove them. - *Hype-driv → Chapter 40: The Future of AI Engineering
AWS Open Data: URL: https://registry.opendata.aws - Description: Hundreds of datasets available directly on S3. Includes satellite imagery, genomics data, and Common Crawl. - Chapters: 28. → Appendix D: Data Sources and Datasets

B

Bahdanau/Luong cross-attention:: $\mathbf{Q}$: decoder hidden states - $\mathbf{K}$: encoder hidden states - $\mathbf{V}$: encoder hidden states (K = V in most formulations) → Chapter 18: The Attention Mechanism
Bayesian: lead to fundamentally different approaches to statistical inference. → Chapter 4: Probability, Statistics, and Information Theory
Benefits of batch normalization:: Allows higher learning rates without divergence - Acts as a mild regularizer (due to batch noise) - Reduces sensitivity to weight initialization - Smooths the loss landscape → Chapter 12: Training Deep Networks
Benefits:: Improves model calibration (predicted probabilities better reflect true likelihoods). - Reduces overfitting, especially for datasets with noisy labels. - Encourages the model to learn more discriminative features for the penultimate layer. - Almost always helps for classification tasks. → Chapter 13: Regularization and Generalization
Bond features (per edge):: Bond type: one-hot encoding of single, double, triple, aromatic (4 dims) - Is conjugated: binary (1 dim) - Is in ring: binary (1 dim) → Case Study 1: Molecular Property Prediction with GNNs
Building intuition through experimentation:: Maintain a personal "lab notebook" of experiments: mini-projects testing ideas, ablation studies on your own models, reproductions of published results. - When you learn a new technique, immediately apply it to a problem you care about. Abstract knowledge becomes concrete through application. - Keep → Chapter 40: The Future of AI Engineering

C

C4 (Colossal Clean Crawled Corpus): Description: A cleaned version of Common Crawl used to train the T5 model. Approximately 750GB of text. - Access: `datasets.load_dataset("c4", "en")` - Chapters: 14, 15. → Appendix D: Data Sources and Datasets
Categorical Features:: [ ] Assess cardinality: use one-hot encoding for low-cardinality and target or frequency encoding for high-cardinality features. - [ ] Check for unseen categories in validation/test data and handle gracefully. - [ ] Consider ordinal encoding only when a natural ordering exists. → Chapter 9: Feature Engineering and Data Pipelines
Challenges encountered:: Patient data was stored in multiple incompatible formats across different hospital systems (HL7 v2, FHIR, proprietary CSV exports). - Data quality was inconsistent: missing values, coding errors, and temporal inconsistencies were common. - HIPAA compliance requirements imposed strict constraints on → Case Study 1: From Rule-Based Systems to Deep Learning at a Healthcare Company
Challenges:: **Underflow:** Small gradient values become zero in fp16 (minimum positive value: ~6e-8). - **Overflow:** Large values exceed fp16 range (max: 65504). - **Loss of precision:** Accumulated rounding errors can affect training. → Chapter 35: Distributed Training and Scaling
channel capacity: provides a fundamental limit that connects to AI in surprising ways. → Chapter 4: Probability, Statistics, and Information Theory
Choosing a data structure:: Homogeneous numerical data -> NumPy array - Tabular data with mixed types -> pandas DataFrame - Sparse data (mostly zeros) -> scipy.sparse - Key-value pairs -> Python dict → Key Takeaways: Python for AI Engineering
CIFAR-10 / CIFAR-100: Description: 60,000 32x32 color images in 10 (or 100) classes. A standard dataset for prototyping and experimentation due to its small size. - License: MIT. - Chapters: 7, 8. - Access: `torchvision.datasets.CIFAR10(root="./data", download=True)` → Appendix D: Data Sources and Datasets
Clustering: partitioning data into meaningful groups 2. **Dimensionality reduction** --- projecting high-dimensional data into lower-dimensional spaces while preserving essential structure 3. **Anomaly detection** --- identifying data points that deviate significantly from the majority → Chapter 7: Unsupervised Learning and Dimensionality Reduction
CNN/DailyMail: Description: 300,000+ article-summary pairs from CNN and DailyMail news articles. The standard benchmark for abstractive summarization. - Access: `datasets.load_dataset("cnn_dailymail", "3.0.0")` - Chapters: 12, 15. → Appendix D: Data Sources and Datasets
COCO (Common Objects in Context): URL: https://cocodataset.org - Description: 330,000+ images with 80 object categories. Includes annotations for object detection, segmentation, keypoints, and image captioning. - License: CC BY 4.0. - Chapters: 22, 23. → Appendix D: Data Sources and Datasets
Cohere Rerank API: Managed service, excellent quality, no infrastructure. - **Jina Reranker** — Open-weight reranker with strong performance. → Chapter 31: Retrieval-Augmented Generation (RAG)
Colossal Clean Crawled Corpus (C4): approximately 750GB of cleaned English text from Common Crawl. The cleaning pipeline removes: → Chapter 20: Pre-training and Transfer Learning for NLP
Common Crawl: URL: https://commoncrawl.org - Description: Petabytes of raw web data collected monthly since 2008. The basis for many pre-training datasets. - Size: Petabytes (raw); filtered subsets vary. - License: Open; content licensing varies per page. - Chapters: 14, 15. → Appendix D: Data Sources and Datasets
Common Crawl Index: Description: Query interface to search Common Crawl archives by URL pattern or domain without downloading the full archive. - Chapters: 28. → Appendix D: Data Sources and Datasets
Common distributions: Bernoulli, Categorical, Gaussian, and others -- are the building blocks for modeling data in AI systems. 3. **Expectation and variance** summarize distributions and appear in loss functions, evaluation metrics, and optimization. 4. **Maximum likelihood estimation** finds parameters that maximize the → Chapter 4: Probability, Statistics, and Information Theory
Common Voice (Mozilla): URL: https://commonvoice.mozilla.org - Description: The world's largest open multilingual voice dataset, with contributions in 100+ languages. Crowdsourced recordings with validated transcriptions. - Size: 20,000+ hours across all languages. - License: CC-0 (public domain). - Chapters: 24, 25. → Appendix D: Data Sources and Datasets
Confusing L2 regularization with weight decay: they differ when using adaptive optimizers like Adam. → Chapter 3 Key Takeaways
connectionism: the idea that intelligence emerges from networks of simple processing units. The backpropagation algorithm, popularized by David Rumelhart, Geoffrey Hinton, and Ronald Williams in 1986, provided a practical method for training multi-layer neural networks. For the first time, researchers had a genera → Chapter 1: The Landscape of AI Engineering
Considerations:: Do not use label smoothing for knowledge distillation (the soft teacher labels already provide smoothing). - Common values: 0.1 for most tasks, 0.05 for tasks with very clean labels, 0.2 for noisy labels. - Label smoothing was a key component in the original Transformer paper (Chapter 14 will discus → Chapter 13: Regularization and Generalization
context window: has grown exponentially: → Chapter 22: Scaling Laws and Large Language Models
Core numerical computing:: **NumPy**: The foundational library for numerical computing in Python. NumPy's n-dimensional arrays and vectorized operations form the bedrock upon which the entire Python AI ecosystem is built. You will use NumPy extensively throughout Chapters 1--10 of this book. - **SciPy**: Scientific computing → Chapter 1: The Landscape of AI Engineering
CTGAN (Conditional Tabular GAN): Description: GAN-based approach for generating synthetic tabular data that preserves statistical properties of the original dataset. - Usage: `pip install ctgan` - Chapters: 6, 14. → Appendix D: Data Sources and Datasets

D

Data preprocessing:: Normalize images to $[-1, 1]$ (matching Tanh output in the generator) rather than $[0, 1]$. - Use data augmentation carefully: random horizontal flips are generally safe, but aggressive augmentations can confuse the discriminator. - For small datasets, consider differentiable augmentation (DiffAugme → Chapter 17: Generative Adversarial Networks
Data Quality Metrics:: Input feature distributions - Missing value rates - Feature correlations - Input volume → Chapter 34: MLOps and LLMOps
Data Understanding Phase:: [ ] Examine data types, missing value patterns, and distributions for every feature. - [ ] Understand the domain context: what do the features represent physically? - [ ] Identify the target variable and check its distribution (balanced vs. imbalanced for classification, skewed vs. symmetric for reg → Chapter 9: Feature Engineering and Data Pipelines
Decoder:: Autoregressive text decoder with learned positional embeddings - Cross-attention to encoder outputs - Predicts tokens from a byte-level BPE vocabulary → Chapter 29: Speech, Audio, and Music AI
deep learning: the use of neural networks with many layers (hence "deep") trained on large datasets with powerful hardware. → Chapter 1: The Landscape of AI Engineering
Deep learning (Chapters 11--16):: Neural network architectures (feedforward, CNN, RNN, transformer) - Training techniques (regularization, optimization, transfer learning) - Framework proficiency (PyTorch) → Chapter 1: The Landscape of AI Engineering
Deep learning frameworks:: **PyTorch**: The most widely used deep learning framework in both research and industry. Its dynamic computation graph and Pythonic design make it the preferred choice for most AI engineers. You will use PyTorch extensively in Chapters 11--22. - **TensorFlow / Keras**: Google's deep learning framewo → Chapter 1: The Landscape of AI Engineering
degradation problem: the surprising observation that simply adding more layers to a network can *decrease* accuracy, not because of overfitting, but because of optimization difficulties. → Chapter 14: Convolutional Neural Networks
Deployment strategy:: **Shadow mode** (Months 12--14): The ML models ran in parallel with the production rule-based system but did not surface predictions to clinicians. This allowed the team to monitor model behavior on live data without risk. - **Limited rollout** (Months 14--16): The hybrid system was deployed to 12 p → Case Study 1: From Rule-Based Systems to Deep Learning at a Healthcare Company
destroying the learned representations: a phenomenon sometimes called "catastrophic forgetting." The new classification head, initialized randomly, needs larger updates to learn meaningful weights quickly. Using **discriminative learning rates** (smaller for pretrained, larger for new layers) allows the pretrained features to adapt gently → Chapter 14: Quiz
Detection methods:: **Artifact detection**: Looking for subtle inconsistencies (lighting, reflections, lip sync, blinking patterns). First-generation deepfakes had obvious tells (blurring around face edges, inconsistent lighting), but modern generators have largely eliminated these. - **Forensic analysis**: Examining m → Chapter 39: AI Safety, Ethics, and Governance
deterministic: given the same input, they produce the same output, and their behavior can be fully specified in advance. → Chapter 1: The Landscape of AI Engineering
Diagnosing mode collapse:: Visual inspection of generated samples for lack of diversity. - Compute the number of distinct modes in generated data (e.g., using a pretrained classifier). For MNIST, classify 10,000 generated samples and count how many of the 10 digit classes are represented. - Monitor the discriminator's loss: i → Chapter 17: Generative Adversarial Networks
Disadvantages:: Computationally expensive for large datasets (must process all *N* examples before a single parameter update) - Cannot exploit redundancy in the data - More likely to get stuck in sharp local minima → Chapter 3: Calculus, Optimization, and Automatic Differentiation
Do the exercises: at minimum, all Part A (conceptual) and Part B (calculations) problems 5. **Study at least one case study** to see concepts applied to real scenarios 6. **Review key takeaways** before moving to the next chapter → How to Use This Book
Do:: {Best practice 1} - {Best practice 2} - {Best practice 3} → ═══════════════════════════════════════════════════════════════════════════════
Document your decisions: the reasoning behind design choices matters as much as the code 5. **Evaluate rigorously** — every project includes evaluation criteria and benchmarks → Part IX: Capstone Projects
Domain knowledge:: Understanding of the specific industry or application area - Ability to translate business problems into technical specifications → Chapter 1: The Landscape of AI Engineering
During DPO/RLHF:: Chosen reward: $\hat{r}(y_w)$ should increase - Rejected reward: $\hat{r}(y_l)$ should decrease - Reward margin: $\hat{r}(y_w) - \hat{r}(y_l)$ should increase, but not explode - KL divergence from reference: should remain bounded (typically < 10 nats) - Accuracy: fraction of pairs correctly ordered → Chapter 25: Alignment: RLHF, DPO, and Beyond
During SFT:: Training loss (should decrease smoothly) - Validation loss (should decrease; divergence from training loss indicates overfitting) - Response quality samples (manual inspection of generated responses) → Chapter 25: Alignment: RLHF, DPO, and Beyond

E

embodied AI: AI systems that interact with the physical world through robotic bodies. The gap between simulated and physical environments (the "sim-to-real gap") is one of the central challenges in robotics. → Chapter 40: The Future of AI Engineering
emergent abilities: capabilities that appear only in models above a certain scale and are not predictable by extrapolating from smaller models. Wei et al. (2022) defined an emergent ability as one that is "not present in smaller models but is present in larger models," with performance transitioning sharply from near-c → Chapter 22: Scaling Laws and Large Language Models
Emerging compute paradigms:: **Optical computing**: Using light for matrix multiplication, potentially achieving orders-of-magnitude energy savings. Companies like Lightmatter and Luminous are developing optical interconnects and compute elements. - **Neuromorphic computing**: Chips inspired by biological neural networks (Intel → Chapter 40: The Future of AI Engineering
Encoder:: Input: 80-channel log-mel spectrogram computed from 30-second audio segments (padded or trimmed) - Two 1D convolution layers with GELU activation for initial feature processing - Sinusoidal positional encoding - Standard Transformer encoder blocks with pre-layer normalization → Chapter 29: Speech, Audio, and Music AI
Exercise 1.1: *Historical Epochs* Identify the four major epochs of AI described in this chapter. For each epoch, name one key achievement and one key limitation that motivated the transition to the next era. → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.10: *Model Parameter Counting* A simple neural network has an input layer of size 784, one hidden layer of size 256, and an output layer of size 10. Calculate the total number of trainable parameters (weights and biases) in this network. → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.11: *Foundation Model Scaling* GPT-2 (2019) had 1.5 billion parameters. GPT-3 (2020) had 175 billion parameters. Calculate the growth factor. If this growth factor continued for two more generations (released one year apart), estimate the parameter count for each. Discuss whether this growth rate is sus → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.12: *Data Requirements* A supervised classification model needs approximately 1,000 labeled examples per class to achieve acceptable performance. If you are building a document classifier with 50 categories and labeling each document takes an average of 2 minutes, estimate: a) The total number of labele → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.13: *Inference Latency Budget* A movie recommendation system must return results within 200ms of a user request. The system has four stages: feature retrieval (50ms), candidate generation (Xms), ranking model inference (80ms), and post-processing (20ms). What is the maximum time allowed for candidate ge → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.14: *Cost Comparison* A company runs inference for a text classification model. Option A uses a GPU server costing $3.00/hour that can handle 500 requests/second. Option B uses a CPU server costing $0.50/hour that can handle 50 requests/second. If the company processes 1 million requests per day, calcul → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.15: *AI Timeline Extension* Extend the `example-01-ai-timeline.py` code to include at least five additional milestones from the history of AI that were not included in the original timeline. Customize the visualization with different colors for different categories of milestone (theoretical, engineering → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.16: *Parameter Counter* Write a Python function using NumPy that takes a list of layer sizes (e.g., `[784, 256, 128, 10]`) and returns the total number of trainable parameters in a fully connected neural network with those layer dimensions. → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.17: *Simple Linear Regression from Scratch* Using only NumPy, implement a simple linear regression model that fits a line $y = wx + b$ to a dataset. Your implementation should: a) Generate synthetic data: $y = 3x + 7 + \epsilon$ where $\epsilon \sim \mathcal{N}(0, 1)$ b) Implement gradient descent to le → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.18: *Ecosystem Audit* Write a Python script that audits your local AI/ML development environment. The script should: a) Check which of the following packages are installed: numpy, scipy, pandas, scikit-learn, matplotlib, torch, tensorflow, transformers b) Report the version of each installed package c) → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.19: *Decision Boundary Visualization* Using NumPy and matplotlib, create a visualization that shows the decision boundaries of three different classifiers on a 2D dataset: a) A simple threshold rule (symbolic/rule-based approach) b) A linear decision boundary (logistic regression-style) c) A non-linear → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.2: *Symbolic vs. Statistical* Explain the fundamental difference between symbolic AI and machine learning in your own words. Give a concrete example of a task where each approach might be preferred, and justify your choices. → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.20: *Forward Pass Implementation* Implement a complete forward pass for a three-layer neural network using only NumPy. Your implementation should include: a) Weight initialization (random normal, scaled by layer size) b) ReLU activation for hidden layers c) Softmax activation for the output layer d) A f → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.21: *Stack Trade-offs* Consider a startup building an AI-powered customer service chatbot. They have a team of three engineers and a budget of $5,000/month for cloud infrastructure. Recommend a specific technology at each layer of the AI stack, justifying your choices based on the team size and budget c → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.22: *Build vs. Buy* For each of the following scenarios, argue whether the company should build a custom AI solution or use an off-the-shelf service (e.g., a cloud AI API). Consider cost, performance, data privacy, competitive advantage, and maintenance burden. a) A bank needs a fraud detection system f → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.23: *Failure Mode Analysis* Choose one of the following AI systems and conduct a failure mode analysis. For each failure mode, describe the potential impact and propose a mitigation strategy. a) An AI system that screens job applications b) An AI system that recommends medical treatments c) An AI system → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.24: *Paradigm Comparison* The chapter describes the evolution from symbolic AI to machine learning to deep learning to foundation models. For the task of **language translation**, describe how each paradigm would approach the problem. What are the strengths and weaknesses of each approach? Which approac → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.25: *Ethical Analysis* A healthcare company wants to build an AI system that predicts which patients are at high risk of developing a chronic disease, so that preventive interventions can be offered early. Analyze this scenario from the perspective of: a) **Fairness**: What biases might be present in th → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.26: *Technology Radar* Create a "technology radar" for AI engineering. Categorize the following technologies into one of four rings: Adopt (proven, use now), Trial (worth pursuing, not yet proven at scale), Assess (worth exploring, not yet proven), Hold (proceed with caution): - PyTorch - TensorFlow - J → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.27: *Historical Deep Dive* Choose one of the following historical AI systems and write a 500-word analysis of its approach, achievements, and limitations. Discuss how the techniques it used relate to modern AI engineering practices. a) MYCIN (medical diagnosis expert system) b) Deep Blue (chess-playing → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.28: *Industry Case Study* Select an industry (healthcare, finance, automotive, retail, or another of your choice) and research how AI engineering practices are applied in that sector. Your analysis should cover: a) Three specific AI applications currently deployed in the industry b) The key technical ch → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.29: *Open Source Investigation* Explore the GitHub repositories of two major AI/ML frameworks (e.g., PyTorch, scikit-learn, Hugging Face Transformers, LangChain). For each repository: a) How many contributors does it have? b) How frequently is it updated? c) What is its license? d) What does the project → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.3: *The Three Ingredients* The deep learning revolution required three ingredients: data, compute, and algorithms. Rank these three in order of importance for the *initial* breakthrough (circa 2012), and separately for the *current* era of foundation models. Justify your ranking in each case. → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.30: *Future Forecast* Based on current trends in AI research and engineering, write a thoughtful forecast of where AI engineering will be in five years. Address: a) Which current techniques will remain dominant? b) What new capabilities might emerge? c) How will the role of the AI engineer change? d) Wh → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.4: *Foundation Model Paradigm Shift* Describe the paradigm shift from traditional ML to the foundation model approach. What are two advantages and two disadvantages of the foundation model paradigm from an AI engineer's perspective? → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.5: *Subfield Identification* For each of the following systems, identify which subfields of AI are involved and explain the role of each: a) A self-driving car navigating a city b) A chatbot that answers customer support questions c) A system that generates music in the style of a given artist d) A rob → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.6: *AI Stack Layers* Draw (or describe in text) the layers of the modern AI stack. For a company building a fraud detection system for credit card transactions, give a specific example of a tool or technology at each layer. → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.7: *Role Distinctions* A startup has a team of five people building an AI-powered medical imaging product. Describe how the responsibilities of an AI engineer, a data scientist, and a software engineer on this team would differ and overlap. → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.8: *AI Winters* What factors have historically caused "AI winters"? Do you think the current era of AI is susceptible to another AI winter? Provide at least three arguments for and three arguments against this possibility. → Chapter 1 Exercises: The Landscape of AI Engineering
Exercise 1.9: *Speedup Estimation* A training job takes 48 hours on a single CPU. Using the rule-of-thumb speedup factor of 10--100x for GPUs, estimate the range of training times on a single GPU. If you need the model trained within 2 hours, what is the minimum speedup factor required, and what might you do to a → Chapter 1 Exercises: The Landscape of AI Engineering

F

Faker: Description: Python library for generating realistic fake data (names, addresses, text, dates, etc.). Useful for generating test fixtures and tabular data. - Usage: `pip install faker` - Chapters: 28, 31. → Appendix D: Data Sources and Datasets
few-shot multimodal learning: learning new tasks from just a few interleaved image-text examples in the context window. → Chapter 28: Key Takeaways
Findings:: The diagonal has the highest values (each position is most similar to itself). - Off-diagonal values decay smoothly with distance. - The pattern is approximately shift-invariant: `similarity[i, j]` depends mainly on `|i - j|`. → Case Study 2: Analyzing Transformer Components
FineWeb / FineWeb-Edu: Description: High-quality filtered web text from HuggingFace, with an educational content subset particularly useful for training smaller models. - License: ODC-By 1.0. - Chapters: 14, 15. → Appendix D: Data Sources and Datasets
First Edition: *"The best way to understand AI is to build it."* → Artificial Intelligence Engineering
flat minima: regions where the loss is low across a wide neighborhood of parameter values. - **Large batches** provide more accurate gradient estimates, which tend to converge to **sharp minima**---narrow valleys in the loss landscape where small perturbations in parameters cause large increases in loss. → Chapter 13: Regularization and Generalization
For efficiency:: If we could identify winning tickets before training, we could train much smaller networks from the start, saving computation. - This has motivated extensive research in **neural network pruning** and **sparse training**. → Chapter 13: Regularization and Generalization
For regularization:: The lottery ticket hypothesis suggests that much of the network's capacity is redundant. Regularization works partly by suppressing these redundant pathways. - Dropout can be seen as a stochastic way to search for winning tickets---it forces the network to find robust subnetworks. → Chapter 13: Regularization and Generalization
For understanding generalization:: The hypothesis suggests that what matters is not the total number of parameters but the structure of the connections. This partly explains why overparameterized networks generalize well---they provide a rich search space for finding good subnetworks. → Chapter 13: Regularization and Generalization
Forward pass (from Section 11.3.2):: **z**^[1] = [0.5, -0.45, 0.1]^T - **a**^[1] = ReLU(**z**^[1]) = [0.5, 0.0, 0.1]^T - *z*^[2] = 0.6 - *y_hat* = sigma(0.6) = 0.6457 → Chapter 11: Neural Networks from Scratch
foundation models: large models pre-trained on broad data that can be adapted to a wide range of downstream tasks. This paradigm shift fundamentally altered the economics and practice of AI engineering: → Chapter 1: The Landscape of AI Engineering

G

gates: the forget gate, input gate, and output gate---each of which is a sigmoid-activated layer that outputs values between 0 and 1, controlling how much information flows through. → Chapter 15: Recurrent Neural Networks and Sequence Modeling
generalization puzzle: motivates much of the research discussed later in this chapter, including the double descent phenomenon (Section 13.9) and implicit regularization (Section 13.12). → Chapter 13: Regularization and Generalization
GigaSpeech: Description: 10,000 hours of English audio from audiobooks, podcasts, and YouTube. Designed as a large-scale ASR training corpus. - License: Apache 2.0. - Chapters: 24. → Appendix D: Data Sources and Datasets
Given:: $P(\text{disease}) = 0.01$, so $P(\text{no disease}) = 0.99$ - $P(\text{positive} \mid \text{disease}) = 0.95$ - $P(\text{positive} \mid \text{no disease}) = 0.10$ → Chapter 4: Probability, Statistics, and Information Theory
GLUE (General Language Understanding Evaluation): URL: https://gluebenchmark.com - Description: A collection of nine sentence- and sentence-pair-level NLU tasks including sentiment analysis (SST-2), textual entailment (MNLI, RTE), paraphrase detection (MRPC, QQP), and linguistic acceptability (CoLA). - Size: Varies by task; SST-2 has approximately → Appendix D: Data Sources and Datasets
Google BigQuery Public Datasets: Description: Petabytes of publicly available datasets queryable via SQL. Includes GitHub activity data, Wikipedia, US Census data, and more. - Pricing: 1TB free queries per month. - Chapters: 28, 38. → Appendix D: Data Sources and Datasets
GPU training: the network was split across two GPUs due to memory constraints - **Local Response Normalization** (LRN) -- later superseded by batch normalization → Chapter 14: Convolutional Neural Networks
greedy decoding: always choosing the most probable next token: → Chapter 19: The Transformer Architecture
Grounded SAM: a pipeline that can segment any object described in text: → Chapter 28: Multimodal Models and Vision-Language AI

H

hard attention: the model is selecting a single input position. Soft attention (where weights are distributed) computes an interpolation between multiple encoder states, enabling smoother gradient flow. → Quiz: The Attention Mechanism
hard labels: one-hot encoded targets where the correct class has probability 1 and all others have probability 0. This forces the model to predict increasingly extreme logits to minimize cross-entropy loss, which has two problems: → Chapter 13: Regularization and Generalization
HuggingFace Datasets Hub: URL: https://huggingface.co/datasets - Description: Centralized repository hosting 100,000+ datasets with a unified Python API. Supports streaming for large datasets. - Access: `datasets.load_dataset("dataset_name")` - Chapters: Used throughout the book. → Appendix D: Data Sources and Datasets
HuggingFace ecosystem: a collection of open-source libraries that has become the standard toolkit for working with pre-trained Transformer models. Throughout the remainder of this book, we will use these libraries alongside PyTorch. → Chapter 20: Pre-training and Transfer Learning for NLP
HuggingFace Inference API: Description: Free-tier access to thousands of hosted models for quick experimentation. - Chapters: 30. → Appendix D: Data Sources and Datasets

I

If you are drawn to applications:: Identify a domain problem (healthcare, climate, education, materials science) and build an end-to-end solution using the techniques from this book. - Talk to domain experts. The most impactful AI applications come from deep understanding of the problem, not just deep understanding of the models. → Chapter 40: The Future of AI Engineering
If you are drawn to engineering:: Deploy a model to production---even a small personal project---and handle the full lifecycle: data, training, serving, monitoring. - Contribute to a major open-source ML project (PyTorch, Hugging Face, vLLM). - Build an evaluation suite for a domain you care about. → Chapter 40: The Future of AI Engineering
If you are drawn to research:: Read the top 10 most-cited papers from the last NeurIPS or ICML. Reimplement at least one. - Pick an open problem from a recent survey paper and attempt a small contribution. - Join a research reading group or start one at your organization. → Chapter 40: The Future of AI Engineering
If you are drawn to safety and governance:: Study the mechanistic interpretability research from Anthropic and DeepMind in depth. - Participate in an AI safety research program (MATS, SERI, Redwood Research). - Read the EU AI Act and build a compliance checklist for a hypothetical high-risk system. → Chapter 40: The Future of AI Engineering
ImageNet (ILSVRC): URL: https://www.image-net.org - Description: 1.28 million training images across 1,000 classes. The foundational benchmark for image classification. ImageNet-21k has approximately 14 million images in 21,841 classes. - License: Research use only (requires registration). - Chapters: 8, 22. → Appendix D: Data Sources and Datasets
ImageNet features transfer broadly: even to domains as different as satellite imagery, medical imaging, and industrial inspection. → Case Study 2: Transfer Learning for Domain-Specific Image Classification
implicit regularization: the idea that SGD itself has an inductive bias toward simpler solutions, as discussed in Chapter 10. → Chapter 13: Regularization and Generalization
Important practical considerations for FID:: Always use the same Inception checkpoint and preprocessing for all comparisons. - FID is sensitive to the number of samples: fewer samples lead to higher variance. Report the number of samples used. - FID computed on different random seeds will vary. Compute FID multiple times and report the mean an → Chapter 17: Generative Adversarial Networks
In this chapter, you will learn to:: Manipulate probabilities using the sum rule, product rule, and Bayes' theorem - Work with discrete and continuous distributions in NumPy - Estimate parameters from data using MLE and MAP - Compute and interpret entropy, cross-entropy, and KL divergence - Apply mutual information to measure statistic → Chapter 4: Probability, Statistics, and Information Theory
Inference: the process of generating predictions from a trained model—happens millions or billions of times. For many organizations, inference costs dwarf training costs within months of deployment. → Chapter 33: Inference Optimization and Model Serving

K

Kaggle API: Description: Programmatic access to download competition data, public datasets, and submit predictions. - Usage: `pip install kaggle && kaggle datasets download -d ` → Appendix D: Data Sources and Datasets
Kaggle Datasets: URL: https://www.kaggle.com/datasets - Description: Community-contributed datasets spanning virtually every domain. Kaggle also hosts competitions with curated datasets and leaderboards. - License: Varies per dataset. - Chapters: 3, 4, 5, 6, 38. - Access: `kaggle datasets download -d ` → Appendix D: Data Sources and Datasets
Key augmentation choices for medical imaging:: **Conservative color jitter:** Color information (redness, pigmentation) is diagnostically important, so hue shifts are kept small. - **Rotation up to 90 degrees:** Dermoscopic images can be captured at any orientation. - **Random erasing:** Simulates partial occlusion by hair, bubbles, or artifacts → Case Study 1: Preventing Overfitting in a Medical Imaging Model
Key challenges encountered:: **Class imbalance**: Fraudulent transactions represented only 0.12% of all transactions. The team used SMOTE oversampling and cost-sensitive learning to address this. - **Feature engineering at scale**: Computing real-time features (e.g., "number of transactions in the last hour") required a streami → Case Study 2: Building an AI Engineering Team from Scratch
Key engineering decisions:: The team used scikit-learn for feature engineering and classical ML, and PyTorch for the deep learning models. - All experiments were tracked in MLflow, enabling reproducibility and systematic comparison. - Feature engineering was a critical step: temporal features (rate of change in vital signs), i → Case Study 1: From Rule-Based Systems to Deep Learning at a Healthcare Company
Key observations:: RoBERTa achieves the highest accuracy, consistent with its improved training recipe. - DistilBERT is only 0.7 percentage points behind BERT with 40% fewer parameters. - ALBERT has the lowest accuracy, likely because its shared parameters limit capacity despite having 12 layers. - T5-Small performs c → Case Study 2: Comparing Pre-trained Models on a Custom NLP Task
Key Parameters:: $\varepsilon$ (`eps`): The radius of the neighborhood around each point - `min_samples`: The minimum number of points within $\varepsilon$ to form a dense region → Chapter 7: Unsupervised Learning and Dimensionality Reduction
Key practical tips:: Always call `model.train()` before training and `model.eval()` before evaluation. This is the most common dropout-related bug (as we noted in Chapter 12 regarding batch normalization). - Dropout interacts with batch normalization. The conventional wisdom is to **not** use dropout in the same block a → Chapter 13: Regularization and Generalization

L

LAION-5B: Description: 5.85 billion image-text pairs scraped from the web. Used to train open-source vision-language models like Stable Diffusion. - License: CC BY 4.0 (metadata); images are linked, not redistributed. - Chapters: 22, 23. → Appendix D: Data Sources and Datasets
language-agnostic: it does not require pre-tokenization or language-specific rules. SentencePiece supports both BPE and unigram language model tokenization. → Chapter 20: Pre-training and Transfer Learning for NLP
latent code: that captures the essential factors of variation in the data. This simple idea, when extended with probabilistic reasoning (Variational Autoencoders), regularization (sparse and denoising variants), and modern training paradigms (contrastive and self-supervised learning), has become one of the pilla → Chapter 16: Autoencoders and Representation Learning
leakage prevention: the qualities that distinguish a research prototype from a deployable system. → Case Study 2: Building a Reproducible ML Pipeline
learned components: models that were trained from data rather than explicitly programmed. This introduces fundamental differences: → Chapter 1: The Landscape of AI Engineering
learning agility: the ability to quickly acquire new knowledge and skills as the landscape shifts. This is not an innate talent but a trainable capability: → Chapter 40: The Future of AI Engineering
LibriSpeech: URL: https://www.openslr.org/12 - Description: 1,000 hours of read English speech from audiobooks. The standard benchmark for automatic speech recognition (ASR). - License: CC BY 4.0. - Chapters: 24. - Access: `datasets.load_dataset("librispeech_asr", "clean")` → Appendix D: Data Sources and Datasets
Limitations of PCA:: Captures only linear relationships; nonlinear structure requires t-SNE, UMAP, or autoencoders - Assumes variance equals importance (not always true---a low-variance feature could be the most predictive) - Components can be difficult to interpret, especially when many features contribute to each comp → Chapter 7: Unsupervised Learning and Dimensionality Reduction
Limitations:: Behavior differs between training and evaluation modes (source of subtle bugs) - Performance degrades with very small batch sizes (batch statistics become noisy) - Not ideal for sequence models where batch statistics mix different sequence lengths → Chapter 12: Training Deep Networks
LLM-Based Generation: Description: Using large language models to generate training data, synthetic conversations, and evaluation sets. This is now the dominant approach for creating instruction-following datasets. - Key technique: Provide few-shot examples in a prompt, then sample diverse completions with temperature > → Appendix D: Data Sources and Datasets
LOF vs. Isolation Forest:: LOF is density-based and excels at detecting **local anomalies**---points that are anomalous relative to their neighborhood, even if they are in a globally dense region. - Isolation Forest is better for **global anomalies** and scales better to high dimensions. - LOF requires computing k-nearest nei → Chapter 7: Unsupervised Learning and Dimensionality Reduction
log-mel spectrogram: the most widely used input representation for audio neural networks: → Chapter 29: Speech, Audio, and Music AI

M

Machine learning (Chapters 7--10):: Classical algorithms (regression, trees, SVMs, clustering) - Feature engineering - Model evaluation and selection - Experiment design → Chapter 1: The Landscape of AI Engineering
Machine learning frameworks:: **scikit-learn**: The standard library for classical ML algorithms --- decision trees, SVMs, k-means clustering, and more. It provides a consistent API for training, evaluating, and deploying models. - **XGBoost / LightGBM**: High-performance gradient boosting libraries that dominate tabular data co → Chapter 1: The Landscape of AI Engineering
margin: the distance between the decision boundary and the nearest data points from each class. This geometric perspective leads to a model with strong theoretical guarantees and excellent practical performance. → Chapter 6: Supervised Learning: Regression and Classification
marginalization: the process of summing (or integrating) over variables we do not observe. In AI, marginalization appears whenever we compute predictions that account for all possible values of latent variables, as in mixture models and variational autoencoders (Chapter 16). → Chapter 4: Probability, Statistics, and Information Theory
Match the tokenizer to the model: never mix tokenizers from different pre-trained models. 3. **Domain-specific pre-training** (further pre-training on domain text before fine-tuning) can significantly improve results for specialized domains. 4. **Gradient accumulation** enables effective large batch sizes on limited GPU memory. 5. * → Chapter 20 Key Takeaways
Mathematical foundations (Chapters 2--6):: Linear algebra (vectors, matrices, decompositions) - Calculus (gradients, optimization) - Probability and statistics (distributions, estimation, hypothesis testing) - Optimization theory (gradient descent, convex optimization) → Chapter 1: The Landscape of AI Engineering
MMLU (Massive Multitask Language Understanding): URL: https://github.com/hendrycks/test - Description: 57 multiple-choice tasks spanning STEM, humanities, social sciences, and more. A standard benchmark for evaluating LLM knowledge breadth. - Size: Approximately 15,000 test questions. - License: MIT. - Chapters: 15, 16, 31. → Appendix D: Data Sources and Datasets
Model Performance Metrics:: Prediction accuracy (when ground truth is available) - Prediction confidence distributions - Prediction latency (p50, p95, p99) - Throughput (requests per second) - Error rates → Chapter 34: MLOps and LLMOps
Monitoring infrastructure:: Real-time model performance tracking (sensitivity, specificity, alert rates). - Data drift detection comparing incoming patient data distributions to training data. - Alert fatigue monitoring (tracking how often clinicians acknowledged vs. dismissed alerts). - Automated retraining pipeline triggered → Case Study 1: From Rule-Based Systems to Deep Learning at a Healthcare Company
Monitoring:: Save generated samples every few hundred iterations for visual inspection. - Track the discriminator's accuracy on real and fake data separately. If it reaches 100% on both, training is likely in a good state. If it reaches 50% on both, the generator is winning. If it oscillates wildly, training is → Chapter 17: Generative Adversarial Networks
MusicCaps: Description: 5,521 music clips with text descriptions, created by Google for music understanding and generation tasks. - Chapters: 25. → Appendix D: Data Sources and Datasets

N

Natural Questions (NQ): URL: https://ai.google.com/research/NaturalQuestions - Description: Real questions from Google Search paired with Wikipedia articles containing the answers. Includes both short and long answer annotations. - Size: 307,000+ training examples. - License: Apache 2.0. - Chapters: 20, 26. → Appendix D: Data Sources and Datasets
No probabilistic interpretation: attention weights would not form a distribution over positions. 4. **Gradient flow** would be altered --- softmax provides a natural gradient that encourages competition between positions. → Quiz: The Attention Mechanism
not symmetric: It is not a true distance metric (violates symmetry and triangle inequality) → Chapter 4: Probability, Statistics, and Information Theory
Numerical Features:: [ ] Check for outliers and decide on a strategy (removal, capping, robust scaling). - [ ] Apply appropriate scaling (standard, min-max, or robust) based on the model type. - [ ] Consider log or power transforms for skewed distributions. - [ ] Create domain-relevant interaction and ratio features. → Chapter 9: Feature Engineering and Data Pipelines

O

Observations from PCA:: Two components capture only about 22% of total variance, meaning the 2D PCA plot is a very lossy representation. - Some digit classes partially overlap. For example, digits 3, 5, and 8 tend to occupy similar regions. - The global layout is informative: digit 0 is far from digit 1, which makes intuit → Case Study 2: Visualizing High-Dimensional Data with t-SNE and UMAP
Observations from the perplexity study:: **Perplexity 5**: Very tight, fragmented clusters. Some digits split into sub-clusters (e.g., different writing styles of "7"). Noise dominates the layout. - **Perplexity 15**: Clusters begin to consolidate. Most digits form coherent groups, but some remain fragmented. - **Perplexity 30** (default): → Case Study 2: Visualizing High-Dimensional Data with t-SNE and UMAP
Observations from the UMAP parameter study:: **n_neighbors=5**: Very local focus. Clusters are tight and separated by large gaps. Sub-structure within clusters is visible. - **n_neighbors=15** (default): Good balance. Clusters are well-separated, and the global layout is meaningful---visually similar digits (e.g., 3, 5, 8) are placed near each → Case Study 2: Visualizing High-Dimensional Data with t-SNE and UMAP
Open Assistant (OASST): Description: 160,000+ human-annotated conversations for training assistant-style models. Includes ranking information for RLHF. - License: Apache 2.0. - Chapters: 16, 17. → Appendix D: Data Sources and Datasets
OpenAI API: Description: Access to GPT-4 and related models. Commonly used for generating synthetic training data, labels, and evaluations. - Chapters: 15, 17, 30. → Appendix D: Data Sources and Datasets
OpenML: URL: https://www.openml.org - Description: A platform for sharing machine learning datasets, tasks, and experiments. Provides standardized interfaces for benchmarking. - License: Varies per dataset. - Chapters: 3, 6. → Appendix D: Data Sources and Datasets
Optimization tips:: Use Adam with $\beta_1 = 0.0$ and $\beta_2 = 0.9$ for WGAN-GP (note: $\beta_1 = 0$, not the default 0.9). - Learning rates between $10^{-4}$ and $2 \times 10^{-4}$ work well for most architectures. - If training diverges, reduce the learning rate rather than adding regularization. - Save checkpoints → Chapter 17: Generative Adversarial Networks
overconfident: they assign high probabilities to their predictions more often than is warranted. **Temperature scaling** (dividing logits by a learned temperature $T > 1$ before softmax) is the simplest and most effective post-hoc calibration method. → Chapter 4: Probability, Statistics, and Information Theory

P

patch corruption: corrupting a small number of patches can disproportionately affect performance, whereas CNNs degrade more gracefully. → Chapter 26: Vision Transformers and Modern Computer Vision
Pipeline and Validation:: [ ] Wrap all preprocessing in a scikit-learn Pipeline. - [ ] Verify no data leakage by checking that all `fit` calls use only training data. - [ ] Use cross-validation to evaluate the impact of each feature engineering decision. - [ ] Apply feature selection to remove noise and redundancy. → Chapter 9: Feature Engineering and Data Pipelines
Point Classification:: **Core point**: Has at least `min_samples` neighbors within $\varepsilon$ - **Border point**: Within $\varepsilon$ of a core point but does not have `min_samples` neighbors itself - **Noise point**: Neither core nor border---these are the outliers → Chapter 7: Unsupervised Learning and Dimensionality Reduction
posterior: our updated belief about hypothesis $H$ after observing data $D$ - $P(D \mid H)$ is the **likelihood** -- the probability of observing the data under the hypothesis - $P(H)$ is the **prior** -- our belief about $H$ before seeing data - $P(D)$ is the **evidence** (or marginal likelihood) -- a normali → Chapter 4: Probability, Statistics, and Information Theory
posterior collapse: a pathology where the encoder ignores the input and the posterior collapses to the prior. → Chapter 16: Autoencoders and Representation Learning
pre-norm: leads to more stable training, especially for deeper models: → Chapter 19: The Transformer Architecture
Pre-training data curation: including deduplication, quality filtering, and domain balance --- is as important as model architecture for achieving strong performance. - **Parameter-efficient fine-tuning** methods (LoRA, adapters, prefix tuning) enable adapting large models to new tasks while updating less than 1% of parameters → Chapter 20: Pre-training and Transfer Learning for NLP
probability flow ODE: a deterministic ordinary differential equation that generates the same marginal distributions as the SDE: → Chapter 27: Diffusion Models and Image Generation
Programming and software engineering:: Python (the lingua franca of AI) - Software design patterns and best practices - Version control (Git) - Testing and debugging - API design → Chapter 1: The Landscape of AI Engineering
Properties:: Output range: (0, 1), useful for probabilities - Smooth and differentiable everywhere - Saturates for large |*z*|, causing the *vanishing gradient problem* - Output is not zero-centered, which can slow convergence → Chapter 11: Neural Networks from Scratch
PyTorch: The deep learning framework whose eager execution model and Pythonic design philosophy make it the ideal teaching tool - **HuggingFace** — For democratizing access to transformer models through the Transformers, Tokenizers, Datasets, and PEFT libraries - **NumPy and SciPy** — The bedrock of scientif → Acknowledgments

Q

Quarter 1: Deep Learning and Transformers: Moved from scikit-learn to PyTorch for daily work - Studied transformer architecture, attention mechanisms, scaling laws - Fine-tuned a model for an internal NLP task (replacing an older sklearn pipeline) - Milestone: Delivered 8% accuracy improvement using transformer model → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 1: Deep Learning Foundations: Completed fast.ai course and PyTorch tutorials - Built 3 projects: image classifier, text classifier, fine-tuned BERT - Studied transformer architecture in depth (attention, positional encoding) - Milestone: Reproduced a simplified GPT-2 training run → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 1: ML Fundamentals Depth: Completed Stanford CS229 (online) for mathematical foundations - Studied PyTorch deeply: custom datasets, training loops, distributed training - Leveraged existing distributed systems knowledge to understand DDP and FSDP - Milestone: Trained a model on 4 GPUs using PyTorch DDP → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 2: LLMs and Applications: Studied LLM capabilities, prompt engineering, RAG architecture - Led a cross-functional project to evaluate LLM integration opportunities - Built proof-of-concept for three internal use cases - Milestone: Secured executive buy-in for LLM pilot project → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 2: ML Infrastructure: Studied MLOps: experiment tracking (W&B), model serving (TorchServe, TGI) - Learned about GPU profiling, memory optimization, quantization - Built an internal tool for experiment comparison at current company - Milestone: Reduced model serving latency by 40% at work → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 2: Production ML: Learned Docker, Kubernetes basics, CI/CD - Built an end-to-end ML pipeline: data processing -> training -> API serving - Contributed to an open-source ML project (documentation + bug fix) - Milestone: Deployed a model behind a FastAPI endpoint → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 3: AI Engineering Leadership: Studied AI safety, responsible AI, evaluation frameworks - Developed team evaluation standards for LLM-powered features - Mentored two junior engineers in deep learning - Milestone: Hired and onboarded two AI engineers for the pilot team → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 3: LLM Engineering: Studied RAG, prompt engineering, fine-tuning with LoRA - Built a RAG chatbot for a personal project (cooking recipes) - Learned evaluation methodologies (RAGAS, human evaluation) - Milestone: Published a blog post about RAG evaluation → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 3: LLM Infrastructure: Studied inference optimization: KV caching, speculative decoding, vLLM - Learned about distributed inference for large models - Contributed to an open-source inference framework - Milestone: Deployed a production LLM serving pipeline → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 4: Job Search + Advanced Topics: Studied system design for ML (Chapter 33) - Practiced ML system design interviews - Explored reinforcement learning basics - Milestone: Received and accepted a junior ML engineer offer → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 4: Specialization + Transition: Deep dive into distributed training (FSDP, DeepSpeed) - Built a training framework prototype for their team - Published an internal tech talk on efficient LLM serving - Milestone: Transitioned to ML infrastructure engineer role (internal transfer) → Case Study 2: Building a Career Roadmap in AI Engineering
Quarter 4: Strategy and Scale: Developed AI engineering roadmap for the product org - Established MLOps practices: experiment tracking, monitoring, A/B testing - Presented AI strategy to C-suite - Milestone: Promoted to AI Engineering Lead → Case Study 2: Building a Career Roadmap in AI Engineering

R

Recipe 2: Training a Transformer for NLP: Optimizer: AdamW with betas (0.9, 0.98), weight decay 0.01 - Learning rate: Peak 5e-4, linear warmup for 4,000 steps, then cosine decay - Batch size: Effective 256--2048 (with gradient accumulation) - Gradient clipping: max_norm=1.0 - Dropout: 0.1 on attention and feed-forward layers - Mixed precisi → Chapter 12: Training Deep Networks
Recipe 3: Fine-Tuning a Pretrained Model: Optimizer: AdamW with weight decay 0.01 - Learning rate: 1e-5 to 5e-5 for pretrained layers, 10x higher for new head - Warmup: 100--500 steps - Epochs: 3--10 (much less than training from scratch) - Gradient clipping: max_norm=1.0 - Use parameter groups to set different learning rates for backbone v → Chapter 12: Training Deep Networks
Recommendation 1: {Action}: *Rationale:* {Why, with specific numbers} - *Expected Impact:* {Quantified benefit} - *Implementation:* {How to do it} → ═══════════════════════════════════════════════════════════════════════════════
Recommendation 2: {Action}: *Rationale:* {Why} - *Expected Impact:* {Benefit} - *Implementation:* {How} → ═══════════════════════════════════════════════════════════════════════════════
Recommendation 3: {Action}: *Rationale:* {Why} - *Expected Impact:* {Benefit} - *Implementation:* {How} → ═══════════════════════════════════════════════════════════════════════════════
RedPajama: Description: An open reproduction of the LLaMA training data, consisting of approximately 1.2 trillion tokens from Common Crawl, C4, GitHub, books, arXiv, Wikipedia, and StackExchange. - License: Apache 2.0. - Chapters: 14, 15. → Appendix D: Data Sources and Datasets
relative position bias: a learnable bias term added to the attention logits based on the relative spatial distance between tokens, parameterized by a table of size $(2M-1) \times (2M-1)$. This differs from ViT's **absolute position embeddings**, which are learnable vectors added to each patch embedding based on its absolut → Chapter 26: Quiz — Vision Transformers and Modern Computer Vision
Reporting accuracy on imbalanced data: Use F1, AUC-PR, or recall at fixed precision instead. 2. **Fitting the scaler on all data before splitting** -- Fit on training data only. 3. **Using standard k-fold for time series** -- Use TimeSeriesSplit instead. 4. **Reporting a single number without uncertainty** -- Always include standard devi → Chapter 8: Key Takeaways
rule-based architecture: the direct descendant of the expert systems described in Section 1.1.1 of this chapter. Over 4,500 hand-crafted if-then rules, developed in collaboration with emergency medicine physicians, encode clinical knowledge about symptom patterns, risk factors, vital sign thresholds, and diagnostic protocol → Case Study 1: From Rule-Based Systems to Deep Learning at a Healthcare Company
Rules of thumb:: Start with K-means for a quick baseline - Use DBSCAN when you expect non-convex clusters or significant noise - Use GMMs when you need probabilistic assignments or ellipsoidal clusters - Use hierarchical clustering when you want to explore multiple granularity levels - Use spectral clustering when c → Chapter 7: Unsupervised Learning and Dimensionality Reduction

S

scale: more parameters, more data, and more compute. The architecture itself changed remarkably little. As we will explore in Chapter 22, the scaling laws that govern this relationship are among the most important empirical findings in modern AI. → Chapter 21: Decoder-Only Models and Autoregressive Language Models
Scoring Guide:: ★ Foundational (5-10 min each) - ★★ Intermediate (10-20 min each) - ★★★ Challenging (20-40 min each) - ★★★★ Advanced/Research (40+ min each) → Exercises: Python for AI Engineering
Scrapy: Description: Python framework for building web scrapers. Useful for collecting domain-specific text data. - Chapters: 28. → Appendix D: Data Sources and Datasets
SDV (Synthetic Data Vault): Description: Comprehensive library for generating synthetic relational, tabular, and time-series data using statistical and deep learning models. - Usage: `pip install sdv` - Chapters: 6, 28. → Appendix D: Data Sources and Datasets
Selenium / Playwright: Description: Browser automation tools for scraping JavaScript-rendered pages. - Chapters: 28. → Appendix D: Data Sources and Datasets
Self-attention:: $\mathbf{Q}$, $\mathbf{K}$, $\mathbf{V}$: all derived from the *same* input sequence, typically through learned linear projections: → Chapter 18: The Attention Mechanism
Self-contained examples: each code file can be run independently - **Progressive framework usage:** NumPy (Ch 1–10) → PyTorch (Ch 11+) → HuggingFace (Ch 20+) → How to Use This Book
Self-Instruct / Evol-Instruct: Description: Methods for using a strong LLM to generate instruction-response pairs, optionally evolving them to increase complexity. Used to create datasets like Alpaca and WizardLM. - Chapters: 16, 17. → Appendix D: Data Sources and Datasets
sequence: it has no notion of order. Positional encodings inject position information into the token representations, breaking this permutation symmetry and allowing the model to distinguish between different orderings of the same tokens. → Quiz: The Attention Mechanism
Setup:: Input: **x** (shape: n_0 x 1) - Hidden layer: **W**^[1] (shape: n_1 x n_0), **b**^[1] (shape: n_1 x 1) - Output layer: **W**^[2] (shape: 1 x n_1), *b*^[2] (scalar) → Chapter 11: Neural Networks from Scratch
Skills Applied:: NumPy vectorized computation for feature engineering - pandas DataFrame operations for data cleaning and aggregation - matplotlib/seaborn visualization for exploratory data analysis - Method chaining for readable transformation pipelines - Code organization with type hints and docstrings → Case Study: Building an End-to-End Data Analysis Pipeline
soft assignments: the probability that each point belongs to each cluster. → Chapter 7: Unsupervised Learning and Dimensionality Reduction
Solutions implemented:: Built an ETL pipeline using Apache Spark to normalize data from multiple hospital systems into a common FHIR-based format. - Implemented a data quality monitoring system that flagged anomalies and missing data patterns. - Established a de-identification pipeline that removed protected health informa → Case Study 1: From Rule-Based Systems to Deep Learning at a Healthcare Company
Spatial reduction attention: it reduces the spatial resolution of keys and values by a factor $R$ before computing attention, changing the complexity from $O(N^2)$ to $O(N^2/R)$. (2) **Hierarchical architecture** — it produces multi-scale features at 1/4, 1/8, 1/16, and 1/32 resolution, so attention at higher levels operates on → Chapter 26: Quiz — Vision Transformers and Modern Computer Vision
Speaker Verification:: **Equal Error Rate (EER)**: The point where false acceptance rate equals false rejection rate - **minDCF**: Minimum Detection Cost Function → Chapter 29: Speech, Audio, and Music AI
Specialized libraries:: **Hugging Face Transformers**: The de facto standard library for working with pre-trained transformer models. - **LangChain / LlamaIndex**: Frameworks for building applications powered by large language models. - **OpenCV**: The standard library for computer vision tasks. → Chapter 1: The Landscape of AI Engineering
Speech Recognition:: **Word Error Rate (WER)**: $\frac{S + D + I}{N}$ where $S$ = substitutions, $D$ = deletions, $I$ = insertions, $N$ = total reference words - **Character Error Rate (CER)**: Same as WER but at character level → Chapter 29: Speech, Audio, and Music AI
SQuAD (Stanford Question Answering Dataset): URL: https://rajpurkar.github.io/SQuAD-explorer/ - Description: SQuAD 1.1 contains 100,000+ question-answer pairs based on Wikipedia passages (extractive QA). SQuAD 2.0 adds 50,000+ unanswerable questions. - License: CC BY-SA 4.0. - Chapters: 10, 12, 20. - Access: `datasets.load_dataset("squad")` → Appendix D: Data Sources and Datasets
Stable Diffusion / SDXL: Description: Open-source text-to-image models that can generate synthetic training images. Useful for data augmentation in computer vision. - Chapters: 22, 23. → Appendix D: Data Sources and Datasets
Stage 1: Feature Alignment Pre-training: Data: 595K image-text pairs from CC3M (filtered) - Only the projection layer $\mathbf{W}$ is trained; both the vision encoder and LLM are frozen - Objective: Image captioning (generate the caption given the image) - This stage teaches the projection layer to translate visual features into the LLM's → Chapter 28: Multimodal Models and Vision-Language AI
Stage 2: Visual Instruction Tuning: Data: 158K multimodal instruction-following examples generated using GPT-4 - The projection layer and the LLM are trained; the vision encoder remains frozen - Data includes conversations, detailed descriptions, and complex reasoning questions - This stage teaches the model to follow multimodal instr → Chapter 28: Multimodal Models and Vision-Language AI
Step 1: Calibration Data Preparation: Selected 256 representative conversations from production logs. - Ensured coverage of all financial advisory topics: portfolio allocation, tax planning, retirement, risk assessment. - Tokenized to match the model's expected input format. → Case Study 1: Deploying a Quantized LLM with vLLM
Structured reading habits:: **Daily arXiv scan**: Use tools like arXiv Sanity, Semantic Scholar alerts, or Papers With Code to surface relevant new papers. Aim to skim 5-10 abstracts daily and deep-read 1-2 papers weekly. - **Conference proceedings**: The top venues (NeurIPS, ICML, ICLR, ACL, CVPR) publish proceedings freely. → Chapter 40: The Future of AI Engineering
SuperGLUE: URL: https://super.gluebenchmark.com - Description: A more challenging successor to GLUE, including tasks like BoolQ (yes/no QA), CB (textual entailment with three classes), MultiRC (multi-sentence reading comprehension), and WiC (word-in-context). - Size: Varies; generally smaller than GLUE tasks. → Appendix D: Data Sources and Datasets
superposed: they represent multiple features simultaneously in overlapping directions. A sparse autoencoder with a large overcomplete basis can disentangle these superposed features into individual, interpretable units. → Chapter 16: Autoencoders and Representation Learning
superposition: encoding many features as nearly orthogonal directions in a lower-dimensional space. This is analogous to compressed sensing: if features are sparse (rarely active simultaneously), they can share dimensions. → Chapter 38: Interpretability, Explainability, and Mechanistic Understanding
symmetry-breaking problem: initialization must introduce asymmetry. → Chapter 12: Training Deep Networks
System Metrics:: CPU/GPU utilization - Memory usage - Disk I/O - Network latency → Chapter 34: MLOps and LLMOps
system prompt: a special instruction that defines the model's persona, capabilities, constraints, and behavioral guidelines. The system prompt is typically prepended to every conversation and is treated as higher-priority than user instructions. → Chapter 22: Scaling Laws and Large Language Models
Systems and infrastructure (Chapters 23--26):: Cloud platforms (AWS, GCP, Azure) - Containerization (Docker, Kubernetes) - Distributed computing - CI/CD pipelines → Chapter 1: The Landscape of AI Engineering

T

technology stack: a layered architecture of hardware, software, and services that work together to enable training, serving, and monitoring of AI models. Understanding this stack is essential for AI engineers, who must make informed decisions at every layer. → Chapter 1: The Landscape of AI Engineering
Temporal Features:: [ ] Extract all relevant time components (hour, day, month, day of week). - [ ] Apply cyclical encoding for periodic features. - [ ] Create lag, rolling, and expanding window features for time series. - [ ] Verify that no future information leaks into historical features. → Chapter 9: Feature Engineering and Data Pipelines
temporally overlapping positive pairs: since video narrations are loosely aligned with visual content, VideoCLIP samples overlapping (but not identical) time windows for positive pairs, creating a softer contrastive signal that handles the inherent temporal misalignment in narrated video. → Chapter 30: Video Understanding and Generation
Text Features:: [ ] Preprocess text (lowercasing, removing noise, tokenizing). - [ ] Choose between BoW, TF-IDF, and embeddings based on the task and data volume. - [ ] Consider n-grams for capturing multi-word patterns. → Chapter 9: Feature Engineering and Data Pipelines
The convergence trajectory:: **2020-2022**: Separate models for each modality (GPT-3 for text, DALL-E for images, Whisper for audio). - **2023-2024**: Multimodal models that handle two or three modalities (GPT-4V for text+images, Gemini for text+images+video). - **2025+**: Omni-modal models that natively process and generate al → Chapter 40: The Future of AI Engineering
The current landscape (2025):: *Closed frontier*: Models from Anthropic (Claude), OpenAI (GPT-4, o3), and Google (Gemini) remain the most capable, particularly for complex reasoning tasks. - *Open-weight frontier*: Meta's Llama 3.1 (405B), DeepSeek-V3, Mistral Large, and others provide openly available weights that match or appro → Chapter 40: The Future of AI Engineering
The Pile: Description: An 800GB curated dataset combining 22 high-quality sub-datasets (academic papers, books, code, etc.) designed for LLM pre-training. - License: MIT (compilation); individual components vary. - Chapters: 14, 15. → Appendix D: Data Sources and Datasets
Token pruning/merging: reduce sequence length by removing or combining less important tokens. (4) **Linear attention** — approximate softmax attention with kernel methods for $O(Nd^2)$ cost. (5) **FlashAttention** — IO-aware implementation that doesn't reduce FLOPs but dramatically improves wall-clock time and memory usag → Chapter 26: Quiz — Vision Transformers and Modern Computer Vision
tokenizer: the algorithm that converts raw text into the subword units the model processes. Unlike traditional word-level tokenization, modern approaches operate at the **subword** level, balancing vocabulary size with the ability to represent any text. → Chapter 20: Pre-training and Transfer Learning for NLP
Total: approximately 30 million parameters: a manageable size for experimentation on a single GPU. → Chapter 21: Decoder-Only Models and Autoregressive Language Models
translation invariance: small shifts in the input lead to the same (or very similar) max values. This complements the translation *equivariance* of the convolution operation itself. → Chapter 14: Convolutional Neural Networks
TriviaQA: Description: 650,000 question-answer-evidence triples gathered from trivia and quiz-league websites. - Chapters: 20, 26. → Appendix D: Data Sources and Datasets
TTS:: **Mean Opinion Score (MOS)**: Subjective human rating on a 1-5 scale - **Mel Cepstral Distortion (MCD)**: Objective measure of spectral distance - **PESQ/POLQA**: Perceptual evaluation of speech quality → Chapter 29: Speech, Audio, and Music AI
Typical findings for BERT-like models:: **Layer 0 (embeddings)**: Encodes surface features (word identity, position). POS tagging probes already achieve moderate accuracy. - **Layers 1-4**: Syntactic information (POS tags, dependency relations, constituency) is maximally represented. Probing accuracy for syntactic tasks peaks in this rang → Chapter 38: Interpretability, Explainability, and Mechanistic Understanding

U

UCI Machine Learning Repository: URL: https://archive.ics.uci.edu - Description: A long-standing collection of 600+ datasets for machine learning research. Includes classics like Iris, Wine, Adult Census, and Boston Housing. - License: Varies; most are freely available for research. - Chapters: 3, 4, 5, 6. → Appendix D: Data Sources and Datasets
UltraFeedback: Description: 64,000 instructions with responses from multiple models, scored on helpfulness, honesty, instruction following, and truthfulness. - Chapters: 16, 17. → Appendix D: Data Sources and Datasets
Understand the domain: What do the raw features mean? What relationships might exist? 2. **Transform features** --- Apply mathematical transformations, encode categories, extract temporal patterns. 3. **Create new features** --- Combine existing features, compute aggregates, derive domain-specific indicators. 4. **Select → Chapter 9: Feature Engineering and Data Pipelines
undertrained: they used too many parameters relative to the amount of training data. The key finding is that model size and training data should scale roughly equally: → Chapter 20: Pre-training and Transfer Learning for NLP
Use BERT/RoBERTa when:: Your task has a fixed output format (classification, token labeling, span extraction). - You need fast inference (encoder-only models are faster than encoder-decoder for classification). - Your fine-tuning dataset is small and you want maximum parameter efficiency. → Chapter 20: Pre-training and Transfer Learning for NLP
Use cross-validation throughout: never evaluate on the test set until the final submission. - **Ensemble diverse feature sets** (Chapter 7) for maximum performance. → Case Study 1: Feature Engineering for a Tabular Competition
Use PyTorch when:: Building production systems - Experimenting with architectures - Training on GPUs - Using pre-built layers, losses, and optimizers - Collaborating with others (PyTorch is the de facto standard in research) → Case Study 2: Comparing NumPy and PyTorch Implementations
Use T5 when:: Your task requires generating variable-length text output (summarization, translation, open-ended QA). - You want to use a single model architecture for multiple different tasks. - You want to frame new tasks flexibly --- just choose a new text prefix. - You are comfortable with the slightly higher → Chapter 20: Pre-training and Transfer Learning for NLP
Use the NumPy approach when:: Learning the fundamentals (this chapter) - Debugging mysterious gradient behavior - Teaching or explaining neural networks - Implementing custom operations not available in PyTorch → Case Study 2: Comparing NumPy and PyTorch Implementations

V

Variants of activation patching:: **Resample ablation**: Replace the activation with a sample from its empirical distribution (rather than from a specific corrupted input), measuring the component's overall importance. - **Path patching**: Patch activations along specific computational paths (e.g., from one attention head to a speci → Chapter 38: Interpretability, Explainability, and Mechanistic Understanding
Visual Question Answering (VQA): URL: https://visualqa.org - Description: 265,000 images with 760,000+ questions and 10 ground-truth answers each. - Chapters: 23. → Appendix D: Data Sources and Datasets
VoxCeleb / VoxCeleb2: Description: Speaker recognition datasets containing speech from thousands of celebrities. VoxCeleb2 has over 1 million utterances from 6,000+ speakers. - Chapters: 24. → Appendix D: Data Sources and Datasets

W

Warmup strategies:: **Linear warmup** (most common): The learning rate increases linearly from 0 to the target - **Exponential warmup**: The learning rate increases exponentially, spending more time at low rates - **Gradual warmup** (Goyal et al., 2017): For very large batch training, warmup over 5--10 epochs → Chapter 12: Training Deep Networks
weight decay: weights literally decay toward zero at each step. → Chapter 13: Regularization and Generalization
weight sharing: directly into the architecture. These biases dramatically reduce parameter count while making networks equivariant to spatial translations. CNNs have dominated computer vision for over a decade, and their principles extend to audio, text, and scientific computing. → Chapter 14 Key Takeaways
What to look for:: **Activation means** should be near zero (for zero-centered activations like tanh) or near 0.5 times the standard deviation (for ReLU) - **Activation standard deviations** should be roughly constant across layers (not growing or shrinking) - **Dead fraction** (fraction of neurons outputting exactly → Chapter 12: Training Deep Networks
When NOT to use it:: During actual training (far too slow---requires 2 forward passes per parameter) - In production code - On very large networks (impractical) → Quiz: Neural Networks from Scratch
When to use it:: When implementing backpropagation from scratch, to verify your analytical gradients are correct - When debugging a custom layer or loss function - As a one-time verification step during development → Quiz: Neural Networks from Scratch
When to Use PCA:: **Preprocessing for supervised learning**: Reduce feature dimensionality before training a classifier or regressor (Chapters 5-6). This can reduce overfitting and speed up training. - **Visualization**: Project high-dimensional data to 2D or 3D for exploration. - **Noise reduction**: Discarding low- → Chapter 7: Unsupervised Learning and Dimensionality Reduction
When to use which:: **Mean:** When data is approximately normally distributed and MCAR. - **Median:** When data is skewed or contains outliers (preferred default). - **Most frequent:** For categorical features. - **Constant:** When you want the model to learn that "missing" is a distinct category. → Chapter 9: Feature Engineering and Data Pipelines

X

XSum (Extreme Summarization): Description: 227,000 BBC news articles with single-sentence summaries. Tests models' ability to generate highly abstractive summaries. - Access: `datasets.load_dataset("xsum")` - Chapters: 12, 15. → Appendix D: Data Sources and Datasets

Z

zero breaks symmetry: all neurons compute identical gradients and never differentiate. - Biases are almost always initialized to zero. - Proper initialization keeps activation variance and gradient variance stable across layers. → Chapter 12 Key Takeaways