Appendix E: Glossary

This glossary defines key terms used throughout this textbook, organized alphabetically. Each entry includes the chapter(s) where the term is most prominently discussed and cross-references to related terms.


A/B Testing (Chapter 31) : A controlled experiment in which users are randomly assigned to one of two (or more) variants to measure the causal effect of a change on a metric of interest. See also: Statistical Significance, Confidence Interval.

Ablation Study (Chapter 31) : An experiment in which components of a model or system are systematically removed or disabled to measure their individual contributions to overall performance. See also: Hyperparameter Tuning.

Activation Function (Chapter 7) : A non-linear function applied element-wise to the output of a neural network layer, enabling the network to learn non-linear mappings. Common examples include ReLU, GELU, sigmoid, and tanh. See also: ReLU, GELU, Sigmoid.

Adam Optimizer (Chapter 7) : An adaptive learning rate optimizer that maintains per-parameter estimates of the first and second moments of the gradient. AdamW is a variant with decoupled weight decay. See also: Stochastic Gradient Descent, Learning Rate.

Adapter (Chapter 16) : A small trainable module inserted into a frozen pre-trained model, enabling parameter-efficient fine-tuning. See also: LoRA, Parameter-Efficient Fine-Tuning.

Alignment (Chapter 17) : The process of training an AI model to behave in accordance with human values, intentions, and instructions. Techniques include RLHF and DPO. See also: RLHF, DPO, Constitutional AI.

Annotator Agreement (Chapter 28) : The degree to which independent human annotators produce the same labels for the same examples. Measured by metrics like Cohen's kappa or Fleiss' kappa. See also: Inter-Rater Reliability.

Attention Mechanism (Chapter 11) : A neural network component that computes a weighted combination of value vectors, where the weights are determined by the compatibility between query and key vectors. See also: Self-Attention, Multi-Head Attention, Cross-Attention.

Autoencoder (Chapter 14) : A neural network trained to reconstruct its input, consisting of an encoder that maps input to a latent representation and a decoder that reconstructs the input from that representation. See also: Variational Autoencoder, Latent Space.

Autograd (Chapter 7) : Automatic differentiation engine (as in PyTorch's autograd) that computes gradients by recording operations on tensors and replaying them in reverse. See also: Backpropagation, Computational Graph.

Autoregressive Model (Chapter 14) : A model that generates output one element at a time, conditioning each element on all previously generated elements. GPT-style language models are autoregressive. See also: Causal Language Model, Teacher Forcing.

Backpropagation (Chapter 7) : An algorithm for computing the gradient of a loss function with respect to all parameters of a neural network by applying the chain rule of calculus from output to input. See also: Chain Rule, Computational Graph, Autograd.

Batch Normalization (Chapter 8) : A technique that normalizes the inputs to a layer across the mini-batch, reducing internal covariate shift and often accelerating training. See also: Layer Normalization, RMSNorm.

Batch Size (Chapter 7) : The number of training examples processed simultaneously in one forward-backward pass. Larger batch sizes provide more stable gradient estimates but require more memory. See also: Mini-Batch Gradient Descent, Gradient Accumulation.

Bayes' Theorem (Chapter 5) : A fundamental result relating conditional probabilities: P(A|B) = P(B|A) * P(A) / P(B). The foundation of Bayesian inference. See also: Prior, Posterior, Likelihood.

Beam Search (Chapter 15) : A decoding strategy that maintains the top-k most likely partial sequences at each generation step, balancing quality and diversity. See also: Greedy Decoding, Nucleus Sampling.

BERT (Chapter 12) : Bidirectional Encoder Representations from Transformers. A pre-trained transformer encoder model that learns contextual word representations using masked language modeling and next sentence prediction. See also: Masked Language Model, Transformer Encoder.

Bias (Statistical) (Chapter 5) : The systematic error introduced by an estimator's tendency to over- or under-estimate a parameter. See also: Variance, Bias-Variance Tradeoff.

Bias (Neural Network) (Chapter 7) : A learnable parameter added to the weighted sum of inputs in a neuron, allowing the activation function's input to be shifted. See also: Weight, Affine Transformation.

Bias-Variance Tradeoff (Chapter 6) : The fundamental tension in machine learning between a model's ability to fit the training data (low bias) and its ability to generalize to unseen data (low variance). See also: Overfitting, Underfitting, Regularization.

BLEU Score (Chapter 15) : Bilingual Evaluation Understudy. A precision-based metric for evaluating generated text against reference text, based on n-gram overlap. See also: ROUGE, BERTScore.

BERTScore (Chapter 31) : An evaluation metric for text generation that computes similarity using contextual embeddings rather than exact n-gram matching. See also: BLEU Score, ROUGE.

Causal Language Model (Chapter 14) : A language model trained to predict the next token given all preceding tokens. Also called a left-to-right or autoregressive language model. GPT is the canonical example. See also: Autoregressive Model, Masked Language Model.

Chain of Thought (CoT) (Chapter 19) : A prompting technique that encourages a language model to produce intermediate reasoning steps before arriving at a final answer, improving performance on complex tasks. See also: Prompt Engineering, Few-Shot Learning.

Chunking (Chapter 26) : The process of splitting documents into smaller segments (chunks) for processing by embedding models and retrieval systems. Chunk size and overlap are important hyperparameters. See also: RAG, Embedding.

Classification (Chapter 4) : A supervised learning task in which the model predicts a discrete label for each input. See also: Regression, Multi-Class Classification, Binary Classification.

Classifier-Free Guidance (Chapter 22) : A technique in diffusion models that interpolates between conditional and unconditional generation to control the strength of conditioning, improving sample quality at the cost of diversity. See also: Diffusion Model.

Compute Budget (Chapter 35) : The total computational resources (measured in FLOPs, GPU-hours, or dollars) allocated for a training run or deployment. See also: Scaling Laws, Chinchilla Scaling.

Conformal Prediction (Chapter 18) : A framework for constructing prediction sets with guaranteed coverage probability under minimal distributional assumptions. See also: Uncertainty Quantification, Calibration.

Constitutional AI (Chapter 17) : An alignment approach developed by Anthropic in which an AI system is guided by a set of explicit principles (a "constitution") during training. See also: Alignment, RLHF.

Contrastive Learning (Chapter 13) : A self-supervised learning approach that trains representations by pulling similar (positive) pairs closer and pushing dissimilar (negative) pairs apart in embedding space. See also: SimCLR, CLIP, Self-Supervised Learning.

Cosine Similarity (Chapter 10) : A similarity metric between two vectors computed as their dot product divided by the product of their norms: cos(theta) = (a . b) / (||a|| * ||b||). Values range from -1 (opposite) to 1 (identical). See also: Embedding, Dot Product.

Cross-Attention (Chapter 11) : An attention mechanism where queries come from one sequence and keys/values come from another, enabling information flow between different input modalities or sequences. See also: Self-Attention, Encoder-Decoder.

Cross-Entropy Loss (Chapter 4) : A loss function that measures the difference between predicted probabilities and true labels. For classification, it equals the negative log-probability assigned to the correct class. See also: Softmax, Maximum Likelihood Estimation.

Cross-Validation (Chapter 6) : A resampling technique for estimating model performance by partitioning data into k folds, training on k-1 folds, and evaluating on the held-out fold, rotating through all folds. See also: Train/Test Split, Overfitting.

Data Augmentation (Chapter 8) : Techniques for artificially expanding training data by applying label-preserving transformations (e.g., image rotation, text paraphrasing). See also: Regularization, Overfitting.

Data Parallelism (Chapter 35) : A distributed training strategy where each device holds a complete model replica and processes a different mini-batch, synchronizing gradients across devices. See also: Model Parallelism, Pipeline Parallelism.

Decoder (Chapter 11) : In a transformer architecture, the component that generates output tokens autoregressively using causal (masked) self-attention and cross-attention to encoder outputs. See also: Encoder, Encoder-Decoder.

Diffusion Model (Chapter 22) : A generative model that learns to reverse a gradual noising process, generating data by iteratively denoising from pure noise. See also: DDPM, Stable Diffusion, Classifier-Free Guidance.

Dimensionality Reduction (Chapter 6) : Techniques for reducing the number of features while preserving important structure. Examples include PCA, t-SNE, and UMAP. See also: PCA, t-SNE, UMAP.

Distillation (Chapter 33) : See Knowledge Distillation.

Distributed Training (Chapter 35) : Training a model across multiple devices (GPUs/TPUs) or machines to reduce wall-clock time or handle models too large for a single device. See also: Data Parallelism, Model Parallelism, FSDP.

DPO (Direct Preference Optimization) (Chapter 17) : An alignment method that directly optimizes a language model from human preference data without training a separate reward model, serving as a simpler alternative to RLHF. See also: RLHF, Alignment.

Dropout (Chapter 7) : A regularization technique that randomly sets a fraction of neuron activations to zero during training, preventing co-adaptation and reducing overfitting. See also: Regularization, Overfitting.

Embedding (Chapter 10) : A dense, continuous vector representation of a discrete object (word, sentence, image, etc.) in a learned vector space. See also: Word2Vec, Sentence Embedding, Cosine Similarity.

Encoder (Chapter 11) : In a transformer architecture, the component that processes the full input sequence bidirectionally to produce contextual representations. BERT is an encoder-only model. See also: Decoder, Encoder-Decoder.

Encoder-Decoder (Chapter 11) : A model architecture with separate encoder and decoder components, connected by cross-attention. T5 and BART are encoder-decoder transformers. See also: Encoder, Decoder, Seq2Seq.

Epoch (Chapter 7) : One complete pass through the entire training dataset. See also: Batch Size, Iteration.

Evaluation Metric (Chapter 31) : A quantitative measure used to assess model performance. Task-specific examples include accuracy, F1 score, BLEU, ROUGE, and perplexity. See also: Loss Function, Benchmark.

Feature Engineering (Chapter 6) : The process of creating, selecting, or transforming input features to improve model performance. Less critical for deep learning than for traditional ML, but still important for tabular data. See also: Feature Selection, Dimensionality Reduction.

Few-Shot Learning (Chapter 19) : Learning to perform a task from only a few examples, typically provided in the prompt (in-context learning) or used for rapid adaptation. See also: Zero-Shot Learning, In-Context Learning, Meta-Learning.

Fine-Tuning (Chapter 12) : Adapting a pre-trained model to a specific downstream task by continuing training on task-specific data. See also: Transfer Learning, Pre-Training, LoRA.

Flash Attention (Chapter 36) : An I/O-aware attention implementation that computes exact attention with reduced memory footprint and improved speed by tiling the computation to avoid materializing the full attention matrix. See also: Attention Mechanism, Memory Efficiency.

FSDP (Fully Sharded Data Parallelism) (Chapter 35) : A distributed training strategy that shards model parameters, gradients, and optimizer states across devices, unsharding only when needed for computation. See also: Data Parallelism, Model Parallelism.

GAN (Generative Adversarial Network) (Chapter 14) : A generative model consisting of a generator and discriminator trained adversarially: the generator tries to produce realistic data while the discriminator tries to distinguish real from generated data. See also: Diffusion Model, VAE.

GELU (Gaussian Error Linear Unit) (Chapter 11) : An activation function defined as x * Phi(x), where Phi is the standard normal CDF. The default activation in most transformer models. See also: Activation Function, ReLU, SiLU.

Generalization (Chapter 6) : A model's ability to perform well on unseen data, not just the training data. See also: Overfitting, Regularization, Bias-Variance Tradeoff.

GPT (Chapter 14) : Generative Pre-trained Transformer. A family of autoregressive language models by OpenAI that demonstrated the power of large-scale pre-training followed by fine-tuning or prompting. See also: Causal Language Model, Pre-Training.

Gradient Accumulation (Chapter 35) : A technique for simulating larger batch sizes by accumulating gradients over multiple forward-backward passes before performing a parameter update. See also: Batch Size, Memory Efficiency.

Gradient Clipping (Chapter 7) : Limiting the magnitude of gradients during backpropagation to prevent exploding gradients. Typically clips the global norm to a maximum value. See also: Vanishing Gradient, Exploding Gradient.

Guardrails (Chapter 32) : Safety mechanisms that constrain model inputs and outputs to prevent harmful, off-topic, or policy-violating content. Can be implemented as input/output filters, system prompts, or specialized classifiers. See also: Safety, Alignment, Red Teaming.

Hallucination (Chapter 15) : When a generative model produces fluent but factually incorrect or fabricated information. A major challenge for deployed LLM systems. See also: Grounding, RAG, Faithfulness.

Hyperparameter (Chapter 6) : A parameter of the learning algorithm (not learned from data) that must be set before training, such as learning rate, batch size, or number of layers. See also: Hyperparameter Tuning, Cross-Validation.

Hyperparameter Tuning (Chapter 6) : The process of searching for optimal hyperparameter values. Methods include grid search, random search, and Bayesian optimization. See also: Hyperparameter, Cross-Validation.

In-Context Learning (ICL) (Chapter 19) : The ability of large language models to learn new tasks from examples provided in the prompt, without any weight updates. See also: Few-Shot Learning, Prompt Engineering.

Inference (Chapter 33) : The process of using a trained model to make predictions on new inputs. In production, inference latency, throughput, and cost are critical concerns. See also: Serving, Latency, Throughput.

Information Retrieval (Chapter 20) : The task of finding relevant documents from a large collection given a query. Key techniques include BM25, dense retrieval, and hybrid methods. See also: RAG, Embedding, Vector Database.

Inter-Rater Reliability (Chapter 28) : See Annotator Agreement.

KL Divergence (Chapter 14) : Kullback-Leibler divergence. A measure of how one probability distribution differs from a reference distribution. Always non-negative and zero only when the distributions are identical. Not symmetric. See also: Cross-Entropy, Entropy, ELBO.

Knowledge Distillation (Chapter 33) : A model compression technique where a smaller "student" model is trained to mimic the outputs (soft targets) of a larger "teacher" model. See also: Model Compression, Quantization, Pruning.

Latent Space (Chapter 14) : The lower-dimensional space in which an encoder represents its inputs. In generative models, sampling from the latent space produces new outputs. See also: Autoencoder, VAE, Embedding.

Layer Normalization (Chapter 11) : A normalization technique that normalizes across the feature dimension for each individual example (rather than across the batch). Standard in transformer architectures. See also: Batch Normalization, RMSNorm.

Learning Rate (Chapter 7) : The step size used in gradient-based optimization. Too large causes divergence; too small causes slow convergence. See also: Learning Rate Schedule, Adam Optimizer.

Learning Rate Schedule (Chapter 7) : A policy for changing the learning rate during training. Common schedules include cosine decay, linear warmup, and step decay. See also: Learning Rate, Warmup.

Likelihood (Chapter 5) : The probability of the observed data given a particular parameter value: P(data | theta). See also: Maximum Likelihood Estimation, Bayes' Theorem.

LoRA (Low-Rank Adaptation) (Chapter 16) : A parameter-efficient fine-tuning method that adds trainable low-rank matrices to frozen model weights, dramatically reducing the number of trainable parameters. See also: Adapter, QLoRA, Parameter-Efficient Fine-Tuning.

Loss Function (Chapter 4) : A function that measures the discrepancy between model predictions and ground truth. The objective that training seeks to minimize. See also: Cross-Entropy Loss, MSE Loss, Objective Function.

LLM (Large Language Model) (Chapter 14) : A language model with a very large number of parameters (typically billions), trained on vast amounts of text data. Examples include GPT-4, Claude, LLaMA, and Gemini. See also: Transformer, Pre-Training, Scaling Laws.

Masked Language Model (Chapter 12) : A pre-training objective where some input tokens are randomly masked and the model must predict them from surrounding context. Used by BERT. See also: BERT, Pre-Training, Self-Supervised Learning.

Maximum Likelihood Estimation (MLE) (Chapter 5) : A method for estimating model parameters by finding the values that maximize the likelihood of the observed data. See also: Likelihood, Cross-Entropy Loss.

Mixed Precision Training (Chapter 35) : Training with lower-precision floating-point formats (e.g., float16, bfloat16) for most operations while maintaining float32 for critical accumulations, reducing memory and computation. See also: Quantization, Memory Efficiency.

Model Card (Chapter 37) : A documentation artifact that describes a model's intended use, performance characteristics, limitations, ethical considerations, and training details. See also: Responsible AI.

Model Parallelism (Chapter 35) : A distributed training strategy where different parts of the model are placed on different devices. Necessary when a model is too large to fit on a single device. See also: Data Parallelism, Pipeline Parallelism, Tensor Parallelism.

Multi-Head Attention (Chapter 11) : An attention mechanism that runs multiple attention operations in parallel with different learned projections, allowing the model to attend to information from different representation subspaces. See also: Self-Attention, Attention Mechanism.

Multimodal (Chapter 23) : Relating to models that process or generate multiple types of data (text, images, audio, video) within a unified architecture. See also: CLIP, Vision-Language Model.

Nucleus Sampling (Top-p) (Chapter 15) : A decoding strategy that samples from the smallest set of tokens whose cumulative probability exceeds a threshold p. See also: Temperature, Top-k Sampling, Beam Search.

Overfitting (Chapter 6) : When a model learns patterns specific to the training data that do not generalize to new data, resulting in high training performance but poor test performance. See also: Underfitting, Regularization, Bias-Variance Tradeoff.

Parameter-Efficient Fine-Tuning (PEFT) (Chapter 16) : A family of methods that adapt pre-trained models by training only a small fraction of parameters. Includes LoRA, adapters, prompt tuning, and prefix tuning. See also: LoRA, Adapter, Fine-Tuning.

PCA (Principal Component Analysis) (Chapter 6) : A dimensionality reduction technique that projects data onto the directions of maximum variance, computed via eigendecomposition of the covariance matrix. See also: Dimensionality Reduction, SVD.

Perplexity (Chapter 14) : A measure of how well a language model predicts a sample. Computed as 2 raised to the power of the cross-entropy. Lower perplexity indicates better prediction. See also: Cross-Entropy, Language Model.

Pipeline Parallelism (Chapter 35) : A distributed training strategy where different layers of the model are placed on different devices, with micro-batches flowing through the pipeline. See also: Model Parallelism, Data Parallelism.

Positional Encoding (Chapter 11) : Information added to token embeddings to encode the position of each token in the sequence, since self-attention is permutation-equivariant. Types include sinusoidal, learned, and rotary (RoPE). See also: RoPE, Transformer.

Posterior (Chapter 5) : The probability distribution of a parameter after observing data: P(theta | data). Obtained by applying Bayes' theorem. See also: Prior, Likelihood, Bayes' Theorem.

Pre-Training (Chapter 12) : Training a model on a large, general-purpose dataset before adapting it to specific downstream tasks. See also: Fine-Tuning, Transfer Learning, Self-Supervised Learning.

Prior (Chapter 5) : The probability distribution representing beliefs about a parameter before observing data: P(theta). See also: Posterior, Bayes' Theorem.

Prompt Engineering (Chapter 19) : The craft of designing input prompts to elicit desired behaviors from large language models. Techniques include few-shot examples, chain-of-thought, and role assignment. See also: In-Context Learning, Chain of Thought.

Pruning (Chapter 33) : A model compression technique that removes unnecessary weights or neurons from a trained model. Can be unstructured (individual weights) or structured (entire channels/heads). See also: Knowledge Distillation, Quantization.

Quantization (Chapter 33) : Reducing the numerical precision of model weights and/or activations (e.g., from float32 to int8 or int4) to reduce memory footprint and increase inference speed. See also: Mixed Precision, GPTQ, AWQ, Model Compression.

QLoRA (Chapter 16) : Quantized LoRA. Combines 4-bit quantization of the base model with LoRA fine-tuning, enabling fine-tuning of very large models on consumer hardware. See also: LoRA, Quantization.

RAG (Retrieval-Augmented Generation) (Chapter 26) : An architecture that combines a retrieval system with a generative model: relevant documents are retrieved from a knowledge base and provided as context to the generator. See also: Information Retrieval, Vector Database, Grounding.

Recall (Chapter 4) : The fraction of actual positive examples that are correctly identified by the model: TP / (TP + FN). See also: Precision, F1 Score.

Red Teaming (Chapter 37) : The practice of adversarially testing an AI system to discover failure modes, safety vulnerabilities, and harmful outputs. See also: Guardrails, Safety, Alignment.

Regression (Chapter 4) : A supervised learning task in which the model predicts a continuous numerical value. See also: Classification, MSE Loss.

Regularization (Chapter 6) : Techniques for preventing overfitting by constraining the model. Examples include L1/L2 weight penalties, dropout, early stopping, and data augmentation. See also: Overfitting, Dropout, Weight Decay.

ReLU (Rectified Linear Unit) (Chapter 7) : An activation function defined as f(x) = max(0, x). Simple and effective, though it can cause "dying neuron" problems. See also: Activation Function, GELU, Leaky ReLU.

Reward Model (Chapter 17) : A model trained on human preference data to predict which of two outputs a human would prefer. Used in RLHF to provide the reward signal. See also: RLHF, Alignment, DPO.

RLHF (Reinforcement Learning from Human Feedback) (Chapter 17) : A training methodology where a language model is fine-tuned using reinforcement learning with a reward signal derived from human preferences. See also: Reward Model, PPO, DPO, Alignment.

RMSNorm (Chapter 11) : Root Mean Square Layer Normalization. A simplified variant of layer normalization that normalizes by the RMS of activations without centering, used in LLaMA and other recent models. See also: Layer Normalization.

RoPE (Rotary Position Embedding) (Chapter 11) : A positional encoding method that encodes position information by rotating query and key vectors in pairs of dimensions, providing relative position awareness. Used in LLaMA and many modern LLMs. See also: Positional Encoding.

ROUGE (Chapter 15) : Recall-Oriented Understudy for Gisting Evaluation. A set of metrics for evaluating text summarization based on n-gram recall between generated and reference summaries. See also: BLEU Score, BERTScore.

Scaling Laws (Chapter 15) : Empirical relationships between model size, dataset size, compute budget, and model performance (loss). Kaplan et al. and Hoffmann et al. (Chinchilla) provide key results. See also: Compute Budget, LLM.

Self-Attention (Chapter 11) : An attention mechanism where queries, keys, and values all come from the same sequence, allowing each position to attend to all other positions. See also: Multi-Head Attention, Cross-Attention, Transformer.

Self-Supervised Learning (Chapter 13) : Learning representations from unlabeled data by creating supervision signals from the data itself (e.g., predicting masked tokens, predicting future tokens, contrastive objectives). See also: Pre-Training, Masked Language Model, Contrastive Learning.

Serving (Chapter 34) : Deploying a trained model as a service that accepts requests and returns predictions. Concerns include latency, throughput, scaling, and cost. See also: Inference, vLLM, TensorRT.

Sigmoid Function (Chapter 7) : The function sigma(x) = 1 / (1 + exp(-x)), which maps real values to (0, 1). Used in binary classification and gating mechanisms. See also: Activation Function, Logistic Regression.

SiLU (Sigmoid Linear Unit) (Chapter 11) : An activation function defined as x * sigmoid(x), also known as Swish. Used in many recent transformer architectures. See also: GELU, Activation Function.

Softmax (Chapter 4) : A function that converts a vector of real numbers into a probability distribution: softmax(z_i) = exp(z_i) / sum_j exp(z_j). See also: Cross-Entropy Loss, Temperature.

Speculative Decoding (Chapter 36) : An inference acceleration technique where a smaller "draft" model proposes multiple tokens that are verified in parallel by the larger target model. See also: Inference, KV Cache.

Stochastic Gradient Descent (SGD) (Chapter 7) : An optimization algorithm that updates parameters using the gradient computed on a random mini-batch rather than the full dataset. See also: Adam Optimizer, Learning Rate, Mini-Batch.

System Prompt (Chapter 19) : A special prompt provided to a language model that sets its role, behavior guidelines, and constraints for all subsequent interactions in a conversation. See also: Prompt Engineering, Guardrails.

Temperature (Chapter 15) : A parameter that controls the randomness of sampling from a language model's output distribution. Higher temperature increases diversity; lower temperature increases determinism. Specifically, logits are divided by temperature before applying softmax. See also: Nucleus Sampling, Top-k Sampling.

Tensor Parallelism (Chapter 35) : A distributed training strategy where individual layers are split across multiple devices, with each device computing a portion of the layer's operations. See also: Model Parallelism, FSDP.

Tokenizer (Chapter 10) : A component that converts raw text into a sequence of token IDs that a model can process. Common algorithms include BPE (Byte Pair Encoding), WordPiece, and SentencePiece. See also: Vocabulary, Subword Tokenization.

Top-k Sampling (Chapter 15) : A decoding strategy that restricts sampling to the k most probable next tokens. See also: Nucleus Sampling, Temperature, Beam Search.

Transfer Learning (Chapter 12) : The practice of applying knowledge gained from one task or domain to a different but related task or domain. In modern AI, this typically means fine-tuning a pre-trained model. See also: Pre-Training, Fine-Tuning, Domain Adaptation.

Transformer (Chapter 11) : A neural network architecture based entirely on attention mechanisms, introduced in the "Attention Is All You Need" paper (Vaswani et al., 2017). The foundation of modern NLP and increasingly used across all modalities. See also: Self-Attention, Multi-Head Attention, Positional Encoding.

t-SNE (Chapter 6) : t-distributed Stochastic Neighbor Embedding. A dimensionality reduction technique for visualization that preserves local neighborhood structure. See also: UMAP, PCA, Dimensionality Reduction.

UMAP (Chapter 6) : Uniform Manifold Approximation and Projection. A dimensionality reduction technique for visualization that preserves both local and global structure better than t-SNE and runs faster. See also: t-SNE, PCA.

Uncertainty Quantification (Chapter 18) : Methods for estimating the confidence or uncertainty of model predictions. Approaches include Bayesian methods, ensembles, Monte Carlo dropout, and conformal prediction. See also: Calibration, Conformal Prediction.

Underfitting (Chapter 6) : When a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and test data. See also: Overfitting, Bias-Variance Tradeoff.

Variational Autoencoder (VAE) (Chapter 14) : A generative model that learns a latent representation by jointly training an encoder (inference network) and decoder (generative network), optimizing the Evidence Lower Bound (ELBO). See also: Autoencoder, Latent Space, KL Divergence, Reparameterization Trick.

Vector Database (Chapter 26) : A database optimized for storing, indexing, and querying high-dimensional vectors. Used in RAG systems for efficient similarity search over embeddings. Examples include Pinecone, Weaviate, Qdrant, Milvus, and ChromaDB. See also: RAG, Embedding, ANN Search.

Vision Transformer (ViT) (Chapter 22) : A transformer architecture adapted for image classification that splits images into patches, linearly embeds them, and processes them with a standard transformer encoder. See also: Transformer, Patch Embedding.

vLLM (Chapter 34) : An open-source high-throughput LLM serving library that uses PagedAttention for efficient memory management of the KV cache. See also: Serving, KV Cache, Inference.

Vocabulary (Chapter 10) : The fixed set of tokens that a model can recognize and generate. Vocabulary size is a key design choice affecting model capacity and tokenization granularity. See also: Tokenizer, Subword Tokenization.

Warmup (Chapter 7) : A learning rate schedule phase where the learning rate is gradually increased from a small value to its target value over the first few training steps, improving training stability. See also: Learning Rate Schedule.

Weight Decay (Chapter 7) : A regularization technique that adds a penalty proportional to the squared magnitude of parameters to the loss function (L2 regularization). In AdamW, weight decay is decoupled from the gradient update. See also: Regularization, L2 Regularization.

Word2Vec (Chapter 10) : A family of models (CBOW, Skip-gram) that learn word embeddings by predicting a word from its context or vice versa. Introduced the concept of dense word vectors. See also: Embedding, GloVe.

Zero-Shot Learning (Chapter 19) : Performing a task without any task-specific training examples, relying solely on the model's pre-trained knowledge and the task description. See also: Few-Shot Learning, In-Context Learning, Transfer Learning.