Glossary

Advanced Data Science: Deep Learning, Causal Inference, and Production Systems at Scale

This glossary defines every key technical term introduced in the book, organized alphabetically. Each entry includes the chapter where the term is first formally defined or substantively introduced. Cross-references point to related concepts.


A

Acquisition Function (Chapter 22)
A function that determines the next point to evaluate in Bayesian optimization by balancing exploration (regions of high uncertainty) and exploitation (regions of high predicted value). Common acquisition functions include Expected Improvement (EI), Upper Confidence Bound (UCB), and Probability of Improvement (PI). See also: Bayesian optimization, surrogate model.
Activation Function (Chapter 6)
A nonlinear function applied element-wise to the output of a neural network layer. Enables the network to learn nonlinear mappings. Common choices include ReLU, sigmoid, tanh, GELU, and Swish. See also: ReLU, GELU, vanishing gradient.
Adapter (Chapter 13)
A small trainable module inserted into a frozen pretrained model to enable parameter-efficient fine-tuning. Adapters add a small number of parameters per layer while keeping the original model weights fixed. See also: LoRA, parameter-efficient fine-tuning.
AdamW (Chapter 2)
An optimizer that combines Adam's adaptive learning rates with decoupled weight decay regularization. Unlike the original Adam with L2 regularization, AdamW applies weight decay directly to the parameters rather than to the gradient, which is more correct for adaptive methods. See also: Adam, weight decay, SGD.
Adjacency Matrix (Chapter 14)
A square matrix A where entry A[i,j] indicates whether nodes i and j are connected by an edge in a graph. For weighted graphs, entries represent edge weights. See also: graph, Laplacian matrix.
All-Reduce (Chapter 26)
A distributed communication primitive where each worker contributes a tensor and all workers receive the sum (or other reduction) of all tensors. In data-parallel training, all-reduce synchronizes gradients across GPUs. Ring all-reduce is the most common communication-efficient variant. See also: data parallelism, DDP, NCCL.
Amortized Analysis (Chapter 5)
A technique for analyzing the average time complexity of a sequence of operations, even when individual operations may occasionally be expensive. Used to analyze data structures like dynamic arrays and hash tables. See also: Big-O notation.
Approximate Nearest Neighbors (ANN) (Chapter 5)
Algorithms that find approximately (not exactly) the nearest neighbors of a query point, trading accuracy for speed. Common methods include locality-sensitive hashing (LSH), HNSW, and IVF. Critical for recommendation serving at scale. See also: LSH, FAISS.
Architecture Decision Record (ADR) (Chapter 24)
A structured document that records a significant design decision, its context, the alternatives considered, and the rationale for the chosen option. Used to maintain institutional knowledge about ML system design choices. See also: ML system architecture.
Arithmetic Intensity (Chapter 5)
The ratio of floating-point operations to memory accesses in a computation. Determines whether a computation is compute-bound (high arithmetic intensity) or memory-bound (low arithmetic intensity). See also: roofline model, FLOPs.
Attention Mechanism (Chapter 9)
A mechanism that allows a neural network to focus on relevant parts of its input when producing each element of its output. Originally introduced for sequence-to-sequence models (Bahdanau, Luong), later generalized into self-attention in transformers. See also: self-attention, transformer.
Augmented IPW (AIPW) (Chapter 18)
A doubly robust causal estimator that combines outcome modeling with inverse probability weighting. Consistent if either the propensity score model or the outcome model is correctly specified, but not necessarily both. See also: IPW, doubly robust estimation.
Autoregressive Generation (Chapter 11)
A generation strategy where each output token is produced one at a time, conditioned on all previously generated tokens. The standard generation approach for decoder-only language models. See also: next-token prediction, language model.
Average Treatment Effect (ATE) (Chapter 16)
The expected difference in potential outcomes across the entire population: E[Y(1) - Y(0)]. The most commonly targeted causal estimand. See also: ATT, ATU, CATE, potential outcomes.
Average Treatment Effect on the Treated (ATT) (Chapter 16)
The expected treatment effect among those who actually received treatment: E[Y(1) - Y(0) | T=1]. Differs from the ATE when treatment effects vary and treatment assignment is non-random. See also: ATE, ATU, selection bias.
Average Treatment Effect on the Untreated (ATU) (Chapter 16)
The expected treatment effect among those who did not receive treatment: E[Y(1) - Y(0) | T=0]. Important for policy decisions about expanding treatment to new populations. See also: ATE, ATT.

B

Backdoor Criterion (Chapter 17)
A graphical condition for identifying valid adjustment sets in causal DAGs. A set of variables Z satisfies the backdoor criterion relative to (X, Y) if Z blocks all backdoor paths from X to Y and no variable in Z is a descendant of X. See also: backdoor path, adjustment set, d-separation.
Backdoor Path (Chapter 17)
A non-causal path between treatment and outcome that begins with an arrow into the treatment variable. Backdoor paths transmit confounding bias and must be blocked for causal identification. See also: backdoor criterion, confounding.
Backpropagation (Chapter 2)
An algorithm for computing gradients of a scalar loss function with respect to all parameters in a computational graph, using the chain rule applied in reverse (output to input). Equivalent to reverse-mode automatic differentiation. See also: chain rule, computational graph, automatic differentiation.
Backpropagation Through Time (BPTT) (Chapter 9)
The application of backpropagation to recurrent neural networks by "unrolling" the network across time steps. Truncated BPTT limits the number of time steps through which gradients flow. See also: RNN, vanishing gradient problem.
Bandwidth Selection (Chapter 18)
In regression discontinuity designs, the choice of how wide a window around the cutoff to use for estimation. Smaller bandwidths reduce bias but increase variance. See also: regression discontinuity.
Batch Normalization (Chapter 7)
A technique that normalizes layer activations across the mini-batch dimension during training. Stabilizes training, allows higher learning rates, and acts as a regularizer. During inference, uses running statistics instead of batch statistics. See also: layer normalization, group normalization.
Bayes Factor (Chapter 20)
The ratio of marginal likelihoods under two competing models. Quantifies the evidence provided by data in favor of one model over another. See also: marginal likelihood, Bayesian model comparison.
Bayesian Credible Interval (Chapter 20)
An interval [a, b] such that the posterior probability of the parameter lying within [a, b] equals a specified level (e.g., 95%). Unlike frequentist confidence intervals, credible intervals have a direct probability interpretation conditional on the observed data. See also: HPDI, posterior distribution.
Bayesian Optimization (Chapter 22)
A sequential strategy for optimizing expensive-to-evaluate black-box functions. Builds a probabilistic surrogate model (typically a Gaussian process) of the objective function and uses an acquisition function to decide where to evaluate next. See also: Gaussian process, acquisition function.
Behavioral Testing (Chapter 28)
A testing paradigm for ML models based on expected model behavior rather than aggregate accuracy metrics. Includes invariance tests (perturbations that should not change output), directional tests (perturbations with a known effect direction), and minimum functionality tests (basic cases the model must handle). See also: CheckList, invariance test.
Beta-Binomial (Chapter 20)
A conjugate Bayesian model where the prior on a Bernoulli success probability is a Beta distribution. After observing successes and failures, the posterior is also Beta-distributed. See also: conjugate prior, Bayesian inference.
Big-O Notation (Chapter 5)
A mathematical notation describing the asymptotic upper bound of an algorithm's time or space complexity as input size grows. O(n log n) means the algorithm's resource usage grows no faster than n log n. See also: time complexity, space complexity.
Blue-Green Deployment (Chapter 29)
A deployment strategy that maintains two identical production environments. Traffic is switched from the old (blue) environment to the new (green) environment instantaneously, with the ability to roll back by switching traffic back. See also: canary deployment, progressive rollout.
BPR Loss (Chapter 14)
Bayesian Personalized Ranking loss, an objective function for learning-to-rank in recommendation systems. Optimizes the pairwise ranking between observed positive items and sampled negative items. See also: link prediction, collaborative filtering.

C

Calibration (Chapter 34)
The property that a model's predicted probabilities match empirical frequencies. A well-calibrated model that predicts 70% probability should be correct approximately 70% of the time. Measured by Expected Calibration Error (ECE). See also: ECE, temperature scaling.
Canary Deployment (Chapter 29)
A deployment strategy that gradually shifts a small percentage of traffic (e.g., 5-10%) to a new model version while monitoring metrics. If metrics degrade, traffic is rolled back. See also: progressive rollout, shadow mode.
Causal DAG (Chapter 17)
A directed acyclic graph where directed edges represent direct causal relationships between variables. Encodes assumptions about the data-generating process and enables identification of causal effects. See also: DAG, structural causal model, d-separation.
Causal Forest (Chapter 19)
A modification of random forests designed to estimate heterogeneous treatment effects. Splits are optimized to maximize treatment effect heterogeneity (not prediction accuracy) across the resulting subgroups. Part of the Generalized Random Forest (GRF) framework. See also: CATE, heterogeneous treatment effect.
Causal Masking (Chapter 10)
An attention mask that prevents each position from attending to future positions in a sequence. Used in decoder-only transformer models to maintain the autoregressive property during training. See also: self-attention, decoder.
Chain Rule (Chapter 2)
The calculus rule for computing derivatives of composite functions: d/dx f(g(x)) = f'(g(x)) * g'(x). The mathematical foundation of backpropagation, extended to multivariate functions via Jacobians. See also: backpropagation, Jacobian.
Changepoint Detection (Chapter 23)
Methods for identifying points in a time series where the statistical properties (mean, variance, distribution) change abruptly. Important for detecting regime changes, anomalies, and concept drift. See also: regime switching, concept drift.
CheckList (Chapter 28)
A behavioral testing methodology for NLP models (Ribeiro et al., 2020) that defines test types: Minimum Functionality Tests (MFT), Invariance Tests (INV), and Directional Expectation Tests (DIR). Adapted broadly for ML model testing. See also: behavioral testing.
Classifier-Free Guidance (Chapter 12)
A technique for controllable generation in diffusion models that interpolates between conditional and unconditional generation during the reverse process. Higher guidance scales increase fidelity to the conditioning signal at the cost of diversity. See also: diffusion model.
CLIP (Chapter 13)
Contrastive Language-Image Pretraining. A foundation model trained to align image and text representations in a shared embedding space using contrastive learning on internet-scale image-caption pairs. See also: contrastive learning, foundation model.
Collider (Chapter 15)
A variable that is a common effect of two or more other variables. Conditioning on a collider opens a spurious association between its causes — a common source of selection bias. See also: Simpson's paradox, d-separation.
Computational Graph (Chapter 2)
A directed graph representing a mathematical computation, where nodes are operations or variables and edges represent data flow. The structure that automatic differentiation traverses to compute gradients. See also: backpropagation, automatic differentiation.
Concept Drift (Chapter 28)
A change in the relationship between model inputs and the target variable over time, causing model performance to degrade. Distinct from data drift (change in input distribution alone). See also: data drift, PSI.
Conditional Average Treatment Effect (CATE) (Chapter 19)
The expected treatment effect conditional on covariates X: E[Y(1) - Y(0) | X=x]. Captures how treatment effects vary across subpopulations. See also: heterogeneous treatment effect, causal forest, meta-learner.
Conformal Prediction (Chapter 34)
A distribution-free framework for constructing prediction sets with guaranteed finite-sample coverage. If the exchangeability assumption holds, a conformal prediction set at level 1-α contains the true outcome with probability at least 1-α. See also: calibration, prediction interval.
Confounding (Chapter 15)
A situation where an observed association between treatment and outcome is distorted by a common cause (confounder) of both. The central challenge of observational causal inference. See also: confounder, backdoor path, Simpson's paradox.
Conjugate Prior (Chapter 20)
A prior distribution that, combined with a particular likelihood function, yields a posterior distribution in the same family. Enables closed-form Bayesian updating. See also: Beta-Binomial, Bayesian inference.
Contrastive Learning (Chapter 13)
A self-supervised learning approach that trains representations by pulling similar (positive) pairs close together and pushing dissimilar (negative) pairs apart in embedding space. InfoNCE is the standard contrastive loss. See also: CLIP, two-tower model.
Convolution (Chapter 8)
A mathematical operation that applies a learned kernel (filter) across the spatial dimensions of an input, computing weighted sums at each position. In CNNs, technically cross-correlation rather than convolution, but the terms are used interchangeably. See also: kernel, feature map, receptive field.
Cosine Annealing (Chapter 7)
A learning rate schedule that decreases the learning rate following a cosine curve from its initial value to near zero over a training cycle. Often combined with warm restarts. See also: learning rate schedule, one-cycle policy.
Cramér-Rao Bound (Chapter 3)
A lower bound on the variance of any unbiased estimator. The variance of an unbiased estimator is at least 1/I(θ), where I(θ) is the Fisher information. See also: Fisher information, MLE.
Cross-Entropy (Chapter 4)
The expected number of bits needed to encode data from distribution p using the optimal code for distribution q: H(p, q) = -E_p[log q]. Minimizing cross-entropy is equivalent to minimizing KL divergence from p and to maximum likelihood estimation. See also: entropy, KL divergence, MLE.
Cross-Fitting (Chapter 19)
A sample-splitting technique used in double/debiased machine learning where nuisance parameters are estimated on one fold and causal effects estimated on the other, then roles are swapped. Prevents overfitting bias. See also: double machine learning.
CUPED (Chapter 33)
Controlled-experiment Using Pre-Experiment Data. A variance reduction technique for A/B testing that uses pre-experiment covariates to reduce the variance of the treatment effect estimator. Achieves variance reduction proportional to the squared correlation between pre- and post-experiment metrics. See also: A/B testing, variance reduction.

D

d-Separation (Chapter 17)
A graphical criterion for reading conditional independence relationships from a causal DAG. Two sets of variables X and Y are d-separated given Z if every path between X and Y is blocked by Z. See also: causal DAG, conditional independence, backdoor criterion.
Dagster (Chapter 27)
A modern data orchestration framework based on software-defined assets. Each pipeline step defines the asset it produces, with dependencies inferred from asset relationships. See also: pipeline orchestration, Airflow.
Data Contract (Chapter 25)
A formal agreement between data producers and consumers specifying schema, quality expectations, freshness guarantees, and SLAs. Prevents silent data quality issues that cause production ML failures. See also: schema evolution, data validation.
Data Drift (Chapter 28)
A change in the distribution of model input features over time, without necessarily implying a change in the target relationship. Measured by PSI, KS statistic, or JS divergence. See also: concept drift, PSI.
Data Lakehouse (Chapter 25)
A data architecture that combines the flexibility of data lakes (storing raw, unstructured data) with the management features of data warehouses (ACID transactions, schema enforcement, indexing). Implemented by frameworks like Delta Lake, Apache Iceberg, and Apache Hudi. See also: data warehouse, data lake.
Data Parallelism (Chapter 26)
A distributed training strategy where each GPU holds a complete copy of the model and processes a different shard of the data. Gradients are synchronized across GPUs via all-reduce. See also: DDP, model parallelism.
Data Processing Inequality (Chapter 4)
An information-theoretic result stating that no processing of data can create information that was not already present: if X → Y → Z forms a Markov chain, then I(X; Z) ≤ I(X; Y). See also: mutual information, sufficient statistic.
Dead Neuron (Chapter 6)
A neuron whose activation is permanently zero for all inputs, typically caused by large negative biases with ReLU activation. Dead neurons consume parameters but contribute nothing to the network output. See also: ReLU, vanishing gradient.
DeepAR (Chapter 23)
An autoregressive deep learning model for probabilistic time series forecasting that outputs the parameters of a probability distribution at each time step, enabling uncertainty quantification. See also: probabilistic forecast, temporal fusion transformer.
Denoising Diffusion Probabilistic Model (DDPM) (Chapter 12)
A generative model that learns to reverse a gradual noising process. The forward process adds Gaussian noise over many steps until the data becomes pure noise; the reverse process learns to denoise step by step to generate samples. See also: diffusion model, forward process.
Delta Lake (Chapter 25)
An open-source storage layer that brings ACID transactions, schema enforcement, and time-travel to data lakes. Built on Parquet files with a transaction log. See also: data lakehouse, Apache Iceberg.
Depthwise Separable Convolution (Chapter 8)
A factorization of a standard convolution into a depthwise convolution (one filter per input channel) and a pointwise 1x1 convolution (combining channels). Reduces parameters and computation by a factor of approximately k², where k is the kernel size. See also: convolution, EfficientNet.
Difference-in-Differences (DiD) (Chapter 18)
A causal estimation method that compares the change in outcomes over time between a treated group and a control group. Requires the parallel trends assumption — that both groups would have followed the same trajectory absent treatment. See also: parallel trends, two-way fixed effects.
Diffusion Model (Chapter 12)
A class of generative models that learn to reverse a gradual noising process. The training objective is to predict the noise added at each step, and generation proceeds by iteratively denoising random noise into data. See also: DDPM, score matching, flow matching.
Directed Acyclic Graph (DAG) (Chapter 17)
A graph with directed edges and no cycles. In causal inference, DAGs encode assumptions about causal structure. In pipeline orchestration, DAGs represent task dependencies. See also: causal DAG, pipeline orchestration.
Direct Preference Optimization (DPO) (Chapter 11)
An alignment method for LLMs that eliminates the need for an explicit reward model by directly optimizing the policy on preference pairs. Simpler than RLHF while achieving comparable results. See also: RLHF, instruction tuning.
Divergence (MCMC) (Chapter 21)
A diagnostic indicating that the Hamiltonian Monte Carlo sampler encountered a region of the posterior where the curvature was too high for the step size. Divergences indicate unreliable posterior estimates and typically require reparameterization. See also: HMC, NUTS, R-hat.
Do-Operator (Chapter 17)
Pearl's notation for interventions: P(Y | do(X=x)) denotes the distribution of Y when X is set to value x by external intervention, as opposed to P(Y | X=x) which conditions on observing X=x. The most important inequality in causal inference: P(Y|do(X)) ≠ P(Y|X) in general. See also: do-calculus, causal DAG.
Double/Debiased Machine Learning (DML) (Chapter 19)
A framework for causal estimation that uses ML models for nuisance parameter estimation (propensity scores, outcome models) while maintaining valid statistical inference for the causal parameter. Based on Neyman orthogonality and cross-fitting. See also: Neyman orthogonality, cross-fitting.
Doubly Robust Estimation (Chapter 18)
A causal estimation strategy that combines outcome modeling and propensity score weighting. Provides consistent estimates if either the outcome model or the propensity score model is correctly specified. See also: AIPW, IPW.
Dropout (Chapter 7)
A regularization technique that randomly sets a fraction of neuron activations to zero during training. Prevents co-adaptation of features and can be interpreted as training an implicit ensemble of sub-networks. Inverted dropout scales surviving activations to maintain expected values. See also: regularization, ensemble.

E

Early Stopping (Chapter 7)
A regularization technique that stops training when validation performance begins to degrade, preventing overfitting. The model from the best validation epoch is selected as the final model. See also: overfitting, regularization.
Effective Sample Size (ESS) (Chapter 21)
A diagnostic for MCMC that estimates how many independent samples the chain's correlated samples are worth. Low ESS relative to the number of iterations indicates poor mixing. See also: MCMC, R-hat.
Eigendecomposition (Chapter 1)
The factorization of a square matrix A into A = QΛQ⁻¹, where Q contains the eigenvectors and Λ is the diagonal matrix of eigenvalues. The "X-ray" of a matrix, revealing its fundamental axes and scaling factors. See also: eigenvalue, eigenvector, SVD.
ELBO (Evidence Lower Bound) (Chapter 4)
A lower bound on the log-marginal-likelihood (evidence) of the data: log p(x) ≥ E_q[log p(x,z)] - E_q[log q(z)]. Maximizing the ELBO is equivalent to minimizing KL(q(z) || p(z|x)). The optimization objective for VAEs and variational inference. See also: VAE, variational inference, KL divergence.
Embedding (Chapter 13)
A learned dense vector representation of a discrete entity (word, user, item) in a continuous space where geometric relationships capture semantic relationships. See also: sentence embedding, contrastive learning.
Entropy (Chapter 4)
The expected information content (or average surprise) of a random variable: H(X) = -Σ p(x) log p(x). Measures the uncertainty or "randomness" of a distribution. Maximum for uniform distributions. See also: cross-entropy, mutual information.
Evidence (Chapter 20)
See marginal likelihood.
Exclusion Restriction (Chapter 18)
The assumption that an instrumental variable affects the outcome only through the treatment variable. This assumption is untestable and must be justified on domain knowledge grounds. See also: instrumental variable, 2SLS.
Expected Calibration Error (ECE) (Chapter 34)
A scalar metric that measures miscalibration by computing the weighted average of the absolute difference between predicted confidence and actual accuracy across binned predictions. See also: calibration, temperature scaling.
Expected Improvement (EI) (Chapter 22)
An acquisition function for Bayesian optimization that measures the expected amount by which a candidate point will improve upon the current best observed value. Balances exploration and exploitation. See also: acquisition function, Bayesian optimization.
Experimentation Platform (Chapter 33)
An end-to-end system for running, monitoring, and analyzing online experiments (A/B tests). Includes randomization infrastructure, metric computation, statistical analysis, and guardrail monitoring. See also: A/B testing, CUPED.
Exploding Gradient (Chapter 6)
A training pathology where gradients grow exponentially during backpropagation, causing unstable parameter updates and numerical overflow. Mitigated by gradient clipping, proper initialization, and normalization. See also: vanishing gradient, gradient clipping.

F

FAISS (Chapter 5)
Facebook AI Similarity Search — a library for efficient similarity search and clustering of dense vectors. Implements ANN algorithms (IVF, HNSW, PQ) for billion-scale nearest neighbor retrieval. See also: approximate nearest neighbors.
Fairlearn (Chapter 31)
An open-source Python toolkit for assessing and improving the fairness of ML models. Provides fairness metrics, constraint-based mitigation algorithms (exponentiated gradient, threshold optimizer), and interactive dashboards. See also: fairness, disparate impact.
Faithfulness (Chapter 17)
The assumption that d-separation in the causal DAG is the only source of conditional independence in the data distribution. Rules out exact cancellation of causal effects. See also: d-separation, Markov condition.
Feature Map (Chapter 8)
The output of applying a convolutional filter to an input. Each filter produces one feature map that highlights the presence of a particular learned pattern at each spatial location. See also: convolution, kernel.
Feature Store (Chapter 25)
An infrastructure component that manages the computation, storage, and serving of features for ML models. Ensures consistency between features used during training (offline) and serving (online). See also: online-offline consistency, point-in-time join.
Fisher Information (Chapter 3)
A measure of the amount of information that an observable random variable carries about an unknown parameter. The expected curvature of the log-likelihood at the true parameter value. See also: Cramér-Rao bound, MLE.
Flash Attention (Chapter 10)
An exact attention algorithm that reduces memory usage from O(n²) to O(n) by tiling the computation to exploit the GPU memory hierarchy (computing in SRAM rather than reading/writing HBM). Produces identical results to standard attention but 2-4x faster. See also: self-attention, arithmetic intensity.
FLOPs (Chapter 5)
Floating-Point Operations. A measure of computational cost. For example, training GPT-3 required approximately 3.14 × 10²³ FLOPs. See also: arithmetic intensity, compute-bound.
Flow Matching (Chapter 12)
A modern generative modeling approach that learns a continuous-time flow from noise to data by regressing a velocity field. Simpler and more flexible than the discrete-time forward-reverse process of diffusion models. See also: diffusion model, normalizing flows.
Forward Process (Chapter 12)
In diffusion models, the process that gradually adds Gaussian noise to data over many time steps until the data is indistinguishable from pure noise. Defined by a noise schedule. The reverse process (learned by the model) inverts this transformation. See also: DDPM, noise schedule.
Front-Door Criterion (Chapter 17)
A graphical criterion for identifying causal effects when the backdoor criterion cannot be satisfied (because confounders are unobserved). Requires a mediator variable that fully mediates the effect and has no unblocked backdoor paths from the treatment. See also: backdoor criterion, do-operator.
Frobenius Norm (Chapter 1)
The matrix analog of the Euclidean norm: ||A||_F = sqrt(Σ_ij A²_ij). Equivalently, the square root of the sum of squared singular values. See also: singular value, matrix norm.
Fully Sharded Data Parallel (FSDP) (Chapter 26)
A memory-efficient distributed training strategy that shards model parameters, gradients, and optimizer states across GPUs (similar to DeepSpeed ZeRO Stage 3). Each GPU holds only a fraction of the model, gathering parameters on-demand for computation. See also: DDP, DeepSpeed.
Fundamental Problem of Causal Inference (Chapter 15)
The impossibility of observing both potential outcomes Y(0) and Y(1) for the same unit at the same time. Every causal inference method is, in some sense, an approach to circumventing this problem. See also: potential outcomes, counterfactual.

G

Gaussian Process (GP) (Chapter 22)
A distribution over functions, specified by a mean function and a kernel (covariance) function. Any finite collection of function values follows a multivariate Gaussian distribution. Used as a surrogate model in Bayesian optimization. See also: kernel function, Bayesian optimization.
Generative Adversarial Network (GAN) (Chapter 12)
A generative model consisting of two networks — a generator and a discriminator — trained in a minimax game. The generator learns to produce realistic samples, while the discriminator learns to distinguish real from generated samples. See also: mode collapse, Wasserstein GAN.
GELU (Chapter 6)
Gaussian Error Linear Unit. An activation function defined as x * Φ(x), where Φ is the standard Gaussian CDF. The default activation in modern transformers. Smoother than ReLU with non-zero gradients for negative inputs. See also: activation function, ReLU.
Generalized Random Forest (GRF) (Chapter 19)
A framework for non-parametric statistical estimation using forest-based methods. Causal forests are the most prominent instance. Provides valid confidence intervals for heterogeneous treatment effects. See also: causal forest, CATE.
Gradient Accumulation (Chapter 7)
A technique that accumulates gradients over multiple forward-backward passes before performing a parameter update, effectively simulating a larger batch size without requiring more GPU memory. See also: gradient checkpointing, mixed precision.
Gradient Checkpointing (Chapter 26)
A memory-saving technique that trades compute for memory by not storing intermediate activations during the forward pass and recomputing them during the backward pass. Reduces memory usage from O(n) to O(sqrt(n)) layers. See also: GPU memory, activation.
Gradient Clipping (Chapter 7)
A technique that limits the magnitude of gradients during training to prevent exploding gradients. Implemented either by clipping the global norm or by clipping individual gradient values. See also: exploding gradient, training stability.
Grad-CAM (Chapter 8)
Gradient-weighted Class Activation Mapping. A visualization technique that highlights the spatial regions of an input image most relevant to a CNN's prediction by computing gradient-weighted feature map activations. See also: interpretability, feature map.
Graph Attention Network (GAT) (Chapter 14)
A graph neural network that uses attention mechanisms to learn different importance weights for different neighbors during message aggregation. See also: GCN, GraphSAGE, attention.
Graph Convolutional Network (GCN) (Chapter 14)
A neural network layer that updates each node's representation by aggregating representations from its neighbors, weighted by the normalized adjacency matrix. The simplest message-passing GNN. See also: message passing, GraphSAGE.
Graph Isomorphism Network (GIN) (Chapter 14)
A GNN architecture proved to be as powerful as the 1-WL graph isomorphism test. Uses sum aggregation (not mean) to maximize expressiveness. See also: Weisfeiler-Leman test, message passing.
GraphSAGE (Chapter 14)
A scalable GNN that learns node representations by sampling and aggregating features from a fixed-size neighborhood. Unlike GCN, supports inductive learning on unseen nodes. See also: GCN, GAT.
Great Expectations (Chapter 28)
An open-source Python framework for data validation. Allows users to define "expectations" (assertions about data properties) that are tested automatically as part of data pipelines. See also: data validation, Pandera.

H

Hallucination (Chapter 11)
The tendency of language models to generate plausible-sounding but factually incorrect or unsupported text. A fundamental property of probabilistic generation, not a bug to be simply fixed. See also: RAG, LLM.
Hamiltonian Monte Carlo (HMC) (Chapter 21)
An MCMC sampling method that uses gradient information to guide proposals, resulting in much more efficient exploration of the posterior than random-walk methods. See also: NUTS, MCMC.
Heterogeneous Treatment Effect (HTE) (Chapter 19)
Variation in treatment effects across individuals or subgroups. When HTEs are present, the ATE may be a poor summary because some subgroups benefit much more (or less) than others. See also: CATE, causal forest.
Hessian Matrix (Chapter 2)
The square matrix of second partial derivatives of a scalar function: H_ij = ∂²f / ∂x_i ∂x_j. Captures the curvature of the loss landscape and determines the nature of critical points (minima, maxima, saddle points). See also: gradient, Jacobian, saddle point.
Hierarchical Model (Chapter 21)
A Bayesian model with multiple levels of parameters, where parameters at one level are drawn from distributions governed by higher-level hyperparameters. Enables partial pooling between groups. See also: partial pooling, hyperprior, shrinkage.
Hidden Markov Model (HMM) (Chapter 23)
A state-space model with discrete latent states, where the system transitions between states according to a Markov chain and observations are conditionally independent given the current state. See also: state-space model, Kalman filter.
Highest Posterior Density Interval (HPDI) (Chapter 20)
The narrowest interval containing a specified probability mass of the posterior distribution. Unlike equal-tailed credible intervals, HPDIs always include the most probable values. See also: Bayesian credible interval.
Hyperband (Chapter 22)
A hyperparameter optimization algorithm that extends successive halving by running it with multiple budget configurations, eliminating the need to choose the trade-off between number of configurations and budget per configuration. See also: Optuna, TPE.

I

Idempotency (Chapter 27)
The property that running an operation multiple times with the same input produces the same result and no additional side effects. The most important design principle for data pipelines. See also: pipeline orchestration.
Ignorability (Chapter 16)
The assumption that treatment assignment is independent of potential outcomes conditional on observed covariates: Y(0), Y(1) ⫫ T | X. Also called "unconfoundedness" or "selection on observables." This assumption is untestable. See also: potential outcomes, confounding.
Importance Sampling (Chapter 3)
A Monte Carlo technique that estimates expectations under one distribution by sampling from a different (proposal) distribution and reweighting. Used when the target distribution is difficult to sample from directly. See also: Monte Carlo estimation.
Information Bottleneck (Chapter 4)
A theoretical framework for understanding neural network representations. A hidden layer should compress the input (minimize I(X; T)) while preserving information about the label (maximize I(T; Y)). See also: mutual information, data processing inequality.
InfoNCE (Chapter 13)
A contrastive loss function derived from noise-contrastive estimation. Maximizes agreement between positive pairs while minimizing agreement with negative samples. The standard loss for modern contrastive learning. See also: contrastive learning, two-tower model.
Instrumental Variable (IV) (Chapter 18)
A variable that affects the outcome only through the treatment variable (exclusion restriction) and is independent of unmeasured confounders. Enables causal estimation when unobserved confounding exists. See also: 2SLS, exclusion restriction.
Inverse Probability Weighting (IPW) (Chapter 18)
A causal estimation technique that reweights observations by the inverse of their treatment propensity to create a pseudo-population where treatment is independent of confounders. See also: propensity score, AIPW.

J

Jacobian Matrix (Chapter 2)
The matrix of first partial derivatives of a vector-valued function: J_ij = ∂f_i / ∂x_j. In ML, the Jacobian of a layer maps input perturbations to output perturbations. See also: Hessian, chain rule, backpropagation.

K

Kalman Filter (Chapter 23)
An optimal recursive estimator for linear Gaussian state-space models. Alternates between predict (propagate state forward) and update (incorporate new observation) steps. The Bayesian interpretation: predict gives the prior, update gives the posterior. See also: state-space model, HMM.
Kernel Function (GP) (Chapter 22)
A function k(x, x') that defines the covariance between function values at points x and x' in a Gaussian process. Common kernels include RBF (squared exponential), Matérn, and rational quadratic. See also: Gaussian process, RBF kernel.
KL Divergence (Chapter 4)
Kullback-Leibler divergence. A measure of how one probability distribution p differs from a reference distribution q: D_KL(p || q) = E_p[log(p/q)]. Non-negative by Gibbs' inequality and zero only when p = q. Not symmetric. See also: cross-entropy, entropy, ELBO.
Knowledge Graph (Chapter 14)
A graph where nodes represent entities and edges represent typed relationships (e.g., "Paris" → "capital_of" → "France"). Used for structured knowledge representation and reasoning. See also: graph, heterogeneous graph.
KV-Cache (Chapter 10)
A caching mechanism in transformer inference that stores the key and value tensors from previous time steps, avoiding recomputation during autoregressive generation. Reduces inference complexity from O(n²) per token to O(n) per token. See also: self-attention, autoregressive generation.

L

LAMB (Chapter 26)
Layer-wise Adaptive Moments optimizer for Batch training. Extends Adam with layer-wise learning rate adaptation, enabling stable training at very large batch sizes (up to 64K). See also: LARS, large-batch training.
Language Model (Chapter 11)
A probabilistic model that assigns probabilities to sequences of tokens. Modern language models are typically autoregressive transformer decoders trained on next-token prediction. See also: autoregressive generation, perplexity.
Laplacian Matrix (Chapter 14)
The matrix L = D - A, where D is the degree matrix and A is the adjacency matrix. Its eigenvalues encode graph connectivity properties and form the basis of spectral graph theory. See also: adjacency matrix, spectral graph theory.
LARS (Chapter 26)
Layer-wise Adaptive Rate Scaling. An optimizer that scales the learning rate for each layer based on the ratio of the parameter norm to the gradient norm. Enables stable training at large batch sizes. See also: LAMB, large-batch training.
Layer Normalization (Chapter 7)
A normalization technique that normalizes across the feature dimension for each individual sample, rather than across the batch. The default normalization in transformer architectures. See also: batch normalization, group normalization.
Learning Rate Schedule (Chapter 7)
A strategy for adjusting the learning rate during training. Common schedules include step decay, cosine annealing, warmup followed by decay, and the one-cycle policy. See also: cosine annealing, warmup.
LightGCN (Chapter 14)
A simplified GCN architecture for collaborative filtering that removes nonlinear activation functions and feature transformation, keeping only the neighborhood aggregation. Achieves strong recommendation performance with minimal computation. See also: GCN, collaborative filtering.
Linear Scaling Rule (Chapter 26)
The heuristic that when increasing batch size by a factor k, the learning rate should also be increased by a factor k. Requires learning rate warmup to work in practice. See also: large-batch training, warmup.
Locality-Sensitive Hashing (LSH) (Chapter 5)
A family of hashing techniques where similar items are more likely to be mapped to the same hash bucket. Enables approximate nearest neighbor search in sublinear time. See also: ANN, FAISS.
LoRA (Chapter 11)
Low-Rank Adaptation. A parameter-efficient fine-tuning method that freezes pretrained weights and injects trainable low-rank decomposition matrices into each layer. Reduces trainable parameters by 100-1000x while maintaining fine-tuning quality. See also: QLoRA, PEFT, adapter.
Loss Landscape (Chapter 2)
The high-dimensional surface defined by the loss function over the parameter space. Properties of the loss landscape (curvature, saddle points, local minima, flat regions) determine optimization difficulty. See also: saddle point, convexity.

M

Marginal Likelihood (Chapter 20)
The probability of the data under a model, integrating over all possible parameter values: p(D) = ∫ p(D|θ) p(θ) dθ. Used for Bayesian model comparison. Also called "model evidence." See also: Bayes factor.
Markov Chain Monte Carlo (MCMC) (Chapter 21)
A class of algorithms that draw samples from a target distribution (typically a posterior) by constructing a Markov chain whose stationary distribution is the target. See also: HMC, NUTS, Metropolis-Hastings.
Matrix Factorization (Chapter 1)
Decomposing a matrix R into the product of two or more lower-rank matrices: R ≈ UV^T. In recommendation systems, factorizes the user-item interaction matrix into user and item latent factor matrices. See also: SVD, collaborative filtering.
Maximum Likelihood Estimation (MLE) (Chapter 3)
Finding the parameter values that maximize the probability (likelihood) of the observed data: θ_MLE = argmax_θ p(D|θ). Equivalent to minimizing negative log-likelihood, which for classification is cross-entropy. See also: MAP, cross-entropy, Fisher information.
Message Passing (Chapter 14)
The computational framework underlying most GNNs: each node updates its representation by aggregating "messages" from its neighbors. Different GNN architectures (GCN, GAT, GraphSAGE) differ in how messages are computed and aggregated. See also: GCN, neighborhood aggregation.
Meta-Learner (Chapter 19)
A framework for estimating heterogeneous treatment effects using standard ML models as building blocks. S-learner (single model), T-learner (two separate models), X-learner (cross-learner), and R-learner (residual learner) differ in how they decompose the CATE estimation problem. See also: CATE, heterogeneous treatment effect.
Mixed Precision Training (Chapter 7)
Training neural networks using lower-precision floating-point formats (fp16 or bf16) for forward and backward passes while maintaining fp32 master weights. Reduces memory usage and increases throughput on GPUs with tensor cores. See also: AMP, loss scaling.
MLflow (Chapter 29)
An open-source platform for managing the ML lifecycle, including experiment tracking, model packaging, model registry, and deployment. See also: model registry, experiment tracking.
MLOps Maturity Levels (Chapter 29)
A framework for assessing the maturity of an organization's ML operations. Level 0: manual everything. Level 1: ML pipeline automation. Level 2: CI/CD pipeline automation. Level 3: automated retraining with monitoring. See also: CI/CD for ML, continuous training.
Mode Collapse (Chapter 12)
A GAN failure mode where the generator produces only a few distinct outputs rather than capturing the full diversity of the data distribution. The generator "collapses" to a mode of the distribution. See also: GAN, Wasserstein GAN.
Model Parallelism (Chapter 26)
A distributed training strategy where different parts of the model reside on different GPUs. Required when the model is too large to fit in a single GPU's memory. See also: pipeline parallelism, tensor parallelism, data parallelism.
Model Registry (Chapter 29)
A centralized store for managing trained model versions, their metadata, lineage, and lifecycle stage (staging, production, archived). See also: MLflow, model versioning.
Monte Carlo Estimation (Chapter 3)
Approximating an expectation by the sample average of function evaluations at random points drawn from the relevant distribution. Convergence rate is O(1/√n) regardless of dimensionality. See also: importance sampling, MCMC.
mSPRT (Chapter 33)
Mixture Sequential Probability Ratio Test. A sequential testing method that allows continuous monitoring of A/B tests without inflating Type I error rates. Based on a mixture of likelihood ratios. See also: sequential testing, experimentation platform.
Multi-Head Attention (Chapter 10)
Running multiple attention operations in parallel with different learned projections, then concatenating the results. Each "head" can learn to attend to different types of relationships (syntactic, semantic, positional). See also: self-attention, transformer.
Mutual Information (Chapter 4)
The amount of information that one random variable contains about another: I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X). Unlike correlation, captures nonlinear dependencies. See also: entropy, data processing inequality.

N

N-BEATS (Chapter 23)
Neural Basis Expansion Analysis for interpretable Time Series. A pure deep learning architecture for time series forecasting that uses backward and forward residual connections with basis expansion. See also: TFT, DeepAR.
NCCL (Chapter 26)
NVIDIA Collective Communications Library. A library for multi-GPU and multi-node communication primitives (all-reduce, broadcast, etc.) optimized for NVIDIA GPUs and high-speed interconnects. See also: all-reduce, DDP.
Neyman Orthogonality (Chapter 19)
A mathematical property ensuring that the causal estimand is insensitive (to first order) to errors in nuisance parameter estimation. The theoretical foundation of double/debiased machine learning. See also: DML, cross-fitting.
Next-Token Prediction (Chapter 11)
The pretraining objective for autoregressive language models: given a sequence of tokens, predict the probability distribution over the next token. Equivalent to maximum likelihood estimation on the training corpus. See also: language model, perplexity.
No-U-Turn Sampler (NUTS) (Chapter 21)
An extension of Hamiltonian Monte Carlo that eliminates the need to hand-tune the trajectory length parameter. Automatically determines when the trajectory begins to "turn around" (U-turn). The default sampler in PyMC and Stan. See also: HMC, MCMC.
Noise Schedule (Chapter 12)
In diffusion models, the schedule that controls how much noise is added at each step of the forward process. Linear, cosine, and learned schedules produce different generation quality. See also: forward process, DDPM.
Normalizing Flows (Chapter 12)
A class of generative models that transform a simple base distribution (e.g., Gaussian) into a complex data distribution through a sequence of invertible transformations with tractable Jacobian determinants. See also: flow matching, generative model.
Null Space (Chapter 1)
The set of all vectors x such that Ax = 0. Its dimension equals the number of columns minus the rank. Captures the "directions that the matrix ignores." See also: rank, column space.

O

One-Cycle Policy (Chapter 7)
A learning rate schedule that increases the learning rate from a low value to a maximum over the first portion of training, then decreases it below the initial value. Empirically enables faster convergence and better generalization. See also: learning rate schedule, cosine annealing.
Online-Offline Consistency (Chapter 25)
The requirement that features computed during model training (offline, on historical data) are identical to features computed during model serving (online, in real-time). Violations cause training-serving skew. See also: feature store, training-serving skew.
Opacus (Chapter 32)
A PyTorch library for training models with differential privacy (DP-SGD). Clips per-sample gradients and adds calibrated Gaussian noise to ensure (ε, δ)-differential privacy. See also: differential privacy, DP-SGD.
Optuna (Chapter 22)
A hyperparameter optimization framework that uses Tree-structured Parzen Estimators (TPE) as the default Bayesian method. Supports pruning (early stopping of unpromising trials) and integration with common ML frameworks. See also: TPE, Hyperband.
Over-Smoothing (Chapter 14)
A pathology in deep GNNs where stacking too many layers causes all node representations to converge to similar values, losing discriminative power. Analogous to information loss in deep networks. See also: GCN, message passing.

P

Pandera (Chapter 28)
A Python library for DataFrame validation that integrates with pandas and supports statistical hypothesis testing on columns. More Python-native than Great Expectations. See also: data validation, Great Expectations.
Parallel Trends Assumption (Chapter 18)
The identifying assumption in difference-in-differences: absent treatment, the treated and control groups would have followed the same trend over time. Untestable but can be assessed with pre-treatment data. See also: DiD.
Partial Pooling (Chapter 21)
The Bayesian hierarchical modeling strategy where group-level estimates are "pulled" toward the global mean, with the degree of pooling determined by group sample size and variability. Groups with little data are pulled more. See also: hierarchical model, shrinkage.
Perplexity (Chapter 11)
A measure of how well a language model predicts a text, defined as 2^H(p,q) where H is cross-entropy. Lower perplexity means better prediction. Interpretable as the effective number of equally likely next tokens. See also: cross-entropy, language model.
Pipeline Parallelism (Chapter 26)
A distributed training strategy that splits the model into sequential stages, each on a different GPU, and pipelines micro-batches to maintain high utilization. See also: model parallelism, tensor parallelism.
Point-in-Time Join (Chapter 25)
A join operation that retrieves feature values as they existed at the time a training example was created, preventing future information from leaking into historical training data. Critical for temporal correctness in feature engineering. See also: feature store, data leakage.
Population Stability Index (PSI) (Chapter 28)
A metric that quantifies how much a variable's distribution has shifted between two datasets (typically training vs. production). PSI > 0.25 generally indicates significant shift. See also: data drift, concept drift.
Positional Encoding (Chapter 10)
Information added to token embeddings to inject sequence order, which self-attention cannot infer on its own. Types include sinusoidal (fixed), learned, and rotary (RoPE). See also: transformer, RoPE.
Posterior Distribution (Chapter 20)
The probability distribution of a parameter after observing data, computed via Bayes' theorem: p(θ|D) ∝ p(D|θ)p(θ). Represents updated beliefs about the parameter. See also: prior, likelihood, Bayesian inference.
Potential Outcomes (Chapter 16)
The pair of outcomes Y(0) and Y(1) that a unit would experience under control and treatment, respectively. Only one is ever observed (the fundamental problem). The Rubin causal model framework. See also: fundamental problem, ATE.
Prior Distribution (Chapter 20)
The probability distribution representing beliefs about a parameter before observing data. Choices range from uninformative (minimal prior knowledge) to informative (strong domain knowledge). See also: posterior, conjugate prior.
Probabilistic Forecast (Chapter 23)
A forecast that outputs a full probability distribution (or prediction intervals) rather than a single point prediction. Essential when downstream decisions depend on uncertainty. See also: conformal prediction, calibration.
Progressive Rollout (Chapter 29)
A deployment strategy that gradually increases the percentage of traffic served by a new model version (e.g., 10% → 25% → 50% → 100%) while monitoring metrics at each stage. See also: canary deployment, blue-green deployment.
Prompt Engineering (Chapter 11)
The practice of crafting input prompts to elicit desired behavior from large language models. Techniques include zero-shot, few-shot, chain-of-thought, and structured output formats. See also: few-shot learning, chain-of-thought.
Propensity Score (Chapter 18)
The probability of receiving treatment given observed covariates: e(x) = P(T=1 | X=x). Balancing on the propensity score is sufficient to remove confounding from observed covariates. See also: IPW, propensity score matching.
PyMC (Chapter 21)
A probabilistic programming framework for Bayesian modeling in Python. Supports model specification, MCMC sampling (NUTS), variational inference, and posterior analysis via ArviZ. See also: NUTS, ArviZ.
PyTorch Geometric (Chapter 14)
A library built on PyTorch for deep learning on graphs and other irregular structures. Provides standard GNN layers (GCN, GAT, GraphSAGE), data loaders for graphs, and benchmark datasets. See also: GNN, message passing.

Q

QLoRA (Chapter 11)
Quantized LoRA. Combines 4-bit quantization of the base model with LoRA adaptation, enabling fine-tuning of large language models on a single consumer GPU. See also: LoRA, quantization.
Quantile Regression (Chapter 23)
A regression method that estimates conditional quantiles (e.g., 10th, 50th, 90th percentile) of the response variable rather than the conditional mean. Produces prediction intervals without distributional assumptions. See also: probabilistic forecast.
Quantization (Chapter 11)
Reducing the numerical precision of model weights and/or activations (e.g., from fp32 to INT8 or INT4) to reduce memory footprint and increase inference speed. Methods include post-training quantization (GPTQ, AWQ) and quantization-aware training. See also: QLoRA, mixed precision.
Qini Curve (Chapter 19)
A diagnostic plot for uplift models that shows the cumulative incremental effect as a function of the fraction of the population treated (ordered by estimated uplift). The causal analog of the lift curve. See also: uplift modeling, CATE.

R

R-hat (Gelman-Rubin) (Chapter 21)
A convergence diagnostic for MCMC that compares within-chain and between-chain variance. Values close to 1.0 (typically < 1.01) indicate convergence; values substantially above 1.0 indicate the chains have not mixed. See also: MCMC, ESS.
R-Learner (Chapter 19)
A meta-learner for CATE estimation that directly targets the treatment effect residual using a loss function based on the Robinson decomposition. The most theoretically grounded meta-learner. See also: meta-learner, CATE.
RAG (Retrieval-Augmented Generation) (Chapter 11)
An architecture that grounds LLM responses in retrieved documents by embedding a query, retrieving relevant chunks from a vector database, and including them in the LLM prompt. Reduces hallucination and enables knowledge updates without retraining. See also: vector database, embedding, hallucination.
Receptive Field (Chapter 8)
The region of the input that influences a particular neuron's output in a CNN. Deeper layers have larger receptive fields due to successive convolutions and pooling operations. See also: convolution, pooling.
Regression Discontinuity (RD) (Chapter 18)
A quasi-experimental design that exploits a sharp cutoff in treatment assignment based on a continuous running variable. Units just above and below the cutoff are compared, approximating a local randomized experiment. See also: sharp RD, fuzzy RD.
Regularization (Chapter 7)
Techniques that constrain model complexity to prevent overfitting. Includes explicit methods (L1/L2 penalty, dropout, weight decay) and implicit methods (early stopping, data augmentation, batch normalization). See also: dropout, weight decay, early stopping.
ReLU (Chapter 6)
Rectified Linear Unit. An activation function defined as f(x) = max(0, x). Computationally efficient and avoids the vanishing gradient problem for positive inputs, but can cause dead neurons. See also: activation function, GELU, dead neuron.
Reparameterization Trick (Chapter 12)
A technique for backpropagating through stochastic sampling by expressing the random variable as a deterministic function of parameters and independent noise: z = μ + σ * ε, where ε ~ N(0,1). Enables gradient-based optimization of the VAE objective. See also: VAE, ELBO.
Residual Connection (Chapter 8)
A shortcut connection that adds a layer's input directly to its output: y = F(x) + x. Enables training of very deep networks by providing a gradient highway. Introduced by ResNet. See also: skip connection, gradient flow.
Roofline Model (Chapter 5)
A performance model that characterizes the peak achievable performance of a computation based on its arithmetic intensity and the hardware's peak compute and memory bandwidth. Operations below the roofline are memory-bound; those above are compute-bound. See also: arithmetic intensity, FLOPs.
RoPE (Rotary Positional Embeddings) (Chapter 10)
A positional encoding method that encodes position information by rotating the query and key vectors. Naturally decays attention with distance and supports extrapolation to longer sequences than seen during training. See also: positional encoding, transformer.

S

Saddle Point (Chapter 2)
A point in the loss landscape where the gradient is zero but the point is neither a local minimum nor a local maximum — the Hessian has both positive and negative eigenvalues. Common in high-dimensional optimization and more problematic than local minima. See also: Hessian, loss landscape.
Scaled Dot-Product Attention (Chapter 10)
The core attention operation: Attention(Q, K, V) = softmax(QK^T / √d_k) V. Scaling by √d_k prevents the dot products from growing too large for high-dimensional keys. See also: multi-head attention, self-attention.
Schema Evolution (Chapter 25)
The process of changing a data schema (adding columns, renaming fields, changing types) while maintaining backward and forward compatibility with existing consumers. See also: data contract, schema registry.
Score Matching (Chapter 12)
A technique for learning the score function (gradient of the log-density) of a data distribution. The foundation of score-based diffusion models: learn the score, then use Langevin dynamics for sampling. See also: diffusion model, DDPM.
Self-Attention (Chapter 10)
An attention mechanism where queries, keys, and values all come from the same sequence, allowing each position to attend to all other positions. The fundamental operation of the transformer architecture. See also: scaled dot-product attention, multi-head attention.
Selection Bias (Chapter 16)
Bias in causal estimates arising from systematic differences between treated and untreated groups. In observational studies, treated units are typically selected in ways that correlate with the outcome. See also: confounding, propensity score.
Sequential Testing (Chapter 33)
Statistical testing methods designed for continuous monitoring, where the analyst can examine results at any point during data collection without inflating the false positive rate. See also: mSPRT, peeking.
Shadow Mode (Chapter 29)
A deployment strategy where a new model receives production traffic and generates predictions, but those predictions are not served to users. Allows evaluation on real traffic without risk. See also: canary deployment.
SHAP (Chapter 35)
SHapley Additive exPlanations. An interpretability method based on Shapley values from cooperative game theory. Assigns each feature an importance value for a particular prediction, with guarantees of local accuracy, missingness, and consistency. See also: LIME, Shapley value.
Shrinkage (Chapter 21)
The statistical phenomenon in hierarchical models where group-level estimates are pulled ("shrunk") toward the global mean. Groups with less data experience more shrinkage. See also: partial pooling, hierarchical model.
Simpson's Paradox (Chapter 15)
A phenomenon where a trend that appears in several subgroups reverses or disappears when the groups are combined. Often caused by confounding variables. A canonical motivation for causal thinking. See also: confounding, collider.
Singular Value Decomposition (SVD) (Chapter 1)
The factorization of any m×n matrix A into A = UΣV^T, where U and V are orthogonal matrices and Σ contains the singular values on its diagonal. The "fundamental theorem of data science" — connects to PCA, low-rank approximation, and matrix completion. See also: eigendecomposition, matrix factorization.
Spectral Graph Theory (Chapter 14)
The study of graph properties through the eigenvalues and eigenvectors of associated matrices (adjacency, Laplacian). Graph convolution in GCNs has a spectral interpretation as filtering in the graph Fourier domain. See also: Laplacian matrix, GCN.
State-Space Model (Chapter 23)
A model with a latent state that evolves over time according to a transition model, observed through a noisy observation model. General framework encompassing Kalman filters, HMMs, and structural time series models. See also: Kalman filter, HMM.
Structural Causal Model (SCM) (Chapter 17)
A formal framework consisting of structural equations, exogenous variables, and a causal graph that defines the data-generating process. Supports reasoning about interventions and counterfactuals. See also: causal DAG, do-operator.
Sufficient Statistic (Chapter 3)
A statistic T(X) that captures all the information in the data X about a parameter θ. Formally, the likelihood conditioned on T(X) does not depend on θ. See also: exponential family, Fisher information.
Surrogate Model (Chapter 22)
A probabilistic model (typically a Gaussian process) that approximates an expensive objective function in Bayesian optimization. Cheap to evaluate and provides uncertainty estimates. See also: Gaussian process, Bayesian optimization.
SUTVA (Chapter 16)
Stable Unit Treatment Value Assumption. Two components: (1) no interference — one unit's treatment does not affect another's outcome; (2) consistency — the treatment is well-defined with no hidden variations. See also: potential outcomes, fundamental problem.
Synthetic Control Method (Chapter 33)
A causal inference method that constructs a "synthetic" comparison unit as a weighted combination of untreated units. Used when only one (or few) treated units exist, such as the effect of a policy change on a single state or country. See also: DiD.

T

Temperature Scaling (Chapter 34)
A post-hoc calibration method that divides model logits by a learned scalar temperature T before the softmax. T > 1 softens probabilities (reducing overconfidence); T < 1 sharpens them. See also: calibration, ECE.
Temporal Fusion Transformer (TFT) (Chapter 23)
A transformer architecture for multi-horizon time series forecasting with interpretable attention. Features variable selection networks, gated residual connections, and temporal attention that identifies which time steps matter most. See also: transformer, probabilistic forecast.
Tensor Parallelism (Chapter 26)
A distributed training strategy that splits individual operations (e.g., large matrix multiplies) across GPUs. Each GPU computes a portion of the result. Requires high-bandwidth inter-GPU communication. See also: model parallelism, pipeline parallelism.
Thompson Sampling (Chapter 22)
A Bayesian approach to the multi-armed bandit problem: sample a reward estimate from each arm's posterior distribution and select the arm with the highest sample. Naturally balances exploration and exploitation. See also: multi-armed bandit, UCB.
Tokenizer (Chapter 11)
An algorithm that converts text into a sequence of integer token IDs. Common approaches include Byte-Pair Encoding (BPE) and SentencePiece. Vocabulary size affects model capacity and sequence length. See also: BPE, language model.
Training-Serving Skew (Chapter 24)
Discrepancies between how features are computed during training and during serving, causing model performance to degrade in production despite good offline metrics. The most common source of production ML bugs. See also: online-offline consistency, feature store.
Transfer Learning (Chapter 13)
Leveraging knowledge from a model trained on one task (source) to improve performance on a related task (target). Strategies range from feature extraction (freeze all layers) to full fine-tuning. See also: fine-tuning, foundation model.
Transformer (Chapter 10)
A neural network architecture based entirely on attention mechanisms (no recurrence, no convolution). Processes all positions in parallel using self-attention, enabling efficient training on GPUs. The dominant architecture for NLP, increasingly for vision and other domains. See also: self-attention, multi-head attention.
Tree-Structured Parzen Estimator (TPE) (Chapter 22)
A Bayesian optimization algorithm that models the conditional probabilities p(x|y<y) and p(x|y≥y) rather than modeling p(y|x) directly. The default algorithm in Optuna. See also: Bayesian optimization, Optuna.
Two-Stage Least Squares (2SLS) (Chapter 18)
The standard instrumental variable estimator. Stage 1: regress treatment on instrument. Stage 2: regress outcome on the fitted treatment values from Stage 1. See also: instrumental variable, exclusion restriction.
Two-Tower Model (Chapter 13)
A neural architecture for retrieval that encodes queries and items with separate "tower" networks into a shared embedding space. Similarity (typically dot product) between embeddings determines relevance. See also: embedding, contrastive learning.

U

Universal Approximation Theorem (Chapter 6)
The theorem stating that a feedforward neural network with a single hidden layer of sufficient width can approximate any continuous function on a compact set to arbitrary accuracy. Does not guarantee learnability or sample efficiency. See also: MLP.
Uplift Modeling (Chapter 19)
A technique for predicting the incremental impact of a treatment (e.g., marketing campaign) on an individual's behavior. Identifies "persuadables" (positive uplift) and "sleeping dogs" (negative uplift). See also: CATE, Qini curve.
Upper Confidence Bound (UCB) (Chapter 22)
A bandit algorithm that selects the arm with the highest upper confidence bound: UCB = μ̂ + c√(ln t / n). Implements "optimism in the face of uncertainty." Also used as an acquisition function in Bayesian optimization. See also: Thompson sampling, multi-armed bandit.

V

Vanishing Gradient (Chapter 6)
A training pathology where gradients shrink exponentially during backpropagation through many layers, causing early layers to learn extremely slowly. Mitigated by residual connections, proper initialization, normalization, and gating mechanisms (LSTM). See also: exploding gradient, residual connection.
Variational Autoencoder (VAE) (Chapter 12)
A generative model that learns a latent representation by jointly training an encoder (maps data to latent distribution) and a decoder (maps latent samples to data), optimizing the ELBO. See also: ELBO, reparameterization trick, latent space.
Variational Inference (Chapter 4)
An optimization-based approach to approximate Bayesian inference that frames posterior computation as an optimization problem: find the distribution q(θ) that minimizes KL(q || p(θ|D)). Faster than MCMC but provides only an approximation. See also: ELBO, KL divergence, MCMC.
Vector Database (Chapter 11)
A database optimized for storing and querying high-dimensional vector embeddings. Supports approximate nearest neighbor search for retrieval-augmented generation. Examples include FAISS, Pinecone, Weaviate, and ChromaDB. See also: RAG, ANN, embedding.

W

Walk-Forward Validation (Chapter 23)
A time series cross-validation strategy that trains on data up to time t, evaluates on data from t to t+h, then advances the training window. Respects temporal ordering and avoids data leakage. See also: backtesting, time series.
Wasserstein GAN (WGAN) (Chapter 12)
A GAN variant that uses the Wasserstein distance (Earth Mover's distance) as the training objective instead of the Jensen-Shannon divergence. Provides more stable training gradients and a meaningful loss metric that correlates with sample quality. See also: GAN, mode collapse.
Weak Instrument (Chapter 18)
An instrumental variable with only a weak association with the treatment variable. Weak instruments amplify any violations of the exclusion restriction and produce biased, imprecise estimates. Diagnosed by the first-stage F-statistic (rule of thumb: F > 10). See also: instrumental variable, 2SLS.
Weight Decay (Chapter 7)
A regularization technique that penalizes large parameter values by adding a fraction of the current weight magnitude to the gradient update. In AdamW, weight decay is decoupled from the gradient, which is more correct for adaptive optimizers. See also: L2 regularization, AdamW.
Weisfeiler-Leman Test (Chapter 14)
A graph isomorphism test based on iterative neighborhood aggregation. Standard message-passing GNNs are at most as expressive as the 1-WL test, meaning they cannot distinguish certain non-isomorphic graphs. See also: GIN, message passing, over-smoothing.

X

X-Learner (Chapter 19)
A meta-learner for CATE estimation that imputes the missing potential outcomes for each unit, then trains a model on the imputed treatment effects. Performs well when treatment and control groups differ in size. See also: meta-learner, T-learner.
Xavier/Glorot Initialization (Chapter 6)
A weight initialization strategy that sets initial weights from a distribution with variance 2 / (n_in + n_out). Maintains constant variance of activations across layers for linear and tanh activations. See also: He initialization, weight initialization.

Z

Zero-Shot (Chapter 13)
Using a model to perform a task it was not explicitly trained for, without providing any task-specific examples. Enabled by foundation models trained on diverse data. See also: few-shot, foundation model.