Appendix F: Notation Guide
This appendix provides a complete reference for the mathematical notation conventions used throughout this textbook. When a symbol appears in multiple contexts, the intended meaning should be clear from the surrounding discussion. Appendix A provides the underlying mathematical definitions for many of these quantities.
F.1 Scalars, Vectors, Matrices, and Tensors
| Convention | Examples | Description |
|---|---|---|
| Lowercase italic | x, y, n, alpha, lambda | Scalars (single numbers) |
| Bold lowercase | x, w, b, h, z | Vectors (assumed column vectors) |
| Bold uppercase | W, X, A, K, V, Q | Matrices |
| Bold uppercase script or calligraphic | T, X | Higher-order tensors (3D and above) |
| Uppercase italic | N, D, T, L | Scalar constants (dataset size, dimensionality, sequence length, number of layers) |
Element Access
| Notation | Meaning |
|---|---|
| x_i | The i-th element of vector x |
| W_{ij} | Element in row i, column j of matrix W |
| X_{b,i,j} | Element of a 3D tensor, indexed by batch b, row i, column j |
| x_{1:t} | The subsequence of elements from index 1 through t |
| X_{i,:} | The i-th row of matrix X (as a row vector) |
| X_{:,j} | The j-th column of matrix X (as a column vector) |
F.2 Sets and Spaces
| Notation | Meaning |
|---|---|
| R | The set of real numbers |
| R^n | n-dimensional real vector space |
| R^{m x n} | The space of m-by-n real matrices |
| Z | The set of integers |
| Z_+ | The set of non-negative integers |
| N | The set of natural numbers |
| {0, 1} | A specific finite set |
| |S| | Cardinality (number of elements) of set S |
| x in S | x is an element of set S |
| S subset T | S is a subset of T |
| S union T | Union of sets S and T |
| S intersection T | Intersection of sets S and T |
| emptyset | The empty set |
| D | A dataset: D = {(x_1, y_1), ..., (x_N, y_N)} |
| D_train, D_val, D_test | Training, validation, and test splits of the dataset |
| X | The input space |
| Y | The output (label) space |
| V | Vocabulary set (for language models) |
F.3 Probability and Statistics
| Notation | Meaning |
|---|---|
| P(A) | Probability of event A |
| P(A | B) | Conditional probability of A given B |
| p(x) | Probability density function (continuous) or probability mass function (discrete) |
| p(x | theta) | Density/mass parameterized by theta |
| p_theta(x) | Alternative notation for parameterized distribution |
| X ~ p | Random variable X follows distribution p |
| X ~ N(mu, sigma^2) | X follows a Gaussian with mean mu and variance sigma^2 |
| E[X] or E_p[X] | Expected value of X (under distribution p) |
| Var(X) | Variance of X |
| Cov(X, Y) | Covariance of X and Y |
| sigma | Standard deviation |
| sigma^2 | Variance |
| mu | Mean |
| rho | Correlation coefficient |
| H(X) | Entropy of X |
| H(p, q) | Cross-entropy between distributions p and q |
| D_KL(p || q) | KL divergence from q to p |
| I(X; Y) | Mutual information between X and Y |
F.4 Linear Algebra Operations
| Notation | Meaning |
|---|---|
| x^T | Transpose of vector x |
| W^T | Transpose of matrix W |
| W^{-1} | Inverse of matrix W |
| W^+ | Moore-Penrose pseudoinverse of W |
| I or I_n | Identity matrix (of size n if specified) |
| 0 | Zero vector or zero matrix (size inferred from context) |
| x . y or x^Ty | Dot product (inner product) of x and y |
| x y^T | Outer product of x and y |
| x circle y | Hadamard (element-wise) product |
| A tensor B | Kronecker product of A and B |
| ||x||_2 | L2 (Euclidean) norm: sqrt(sum_i x_i^2) |
| ||x||_1 | L1 (Manhattan) norm: sum_i |x_i| |
| ||x||_p | Lp norm: (sum_i |x_i|^p)^{1/p} |
| ||A||_F | Frobenius norm: sqrt(sum_{i,j} A_{ij}^2) |
| det(A) | Determinant of A |
| tr(A) | Trace of A: sum of diagonal elements |
| rank(A) | Rank of A |
| diag(v) | Diagonal matrix with entries from vector v |
| diag(A) | Vector of diagonal elements of matrix A |
| lambda_i | Eigenvalue (i-th) |
| sigma_i | Singular value (i-th) |
| U, Sigma, V | Matrices in SVD decomposition: A = U Sigma V^T |
F.5 Calculus and Optimization
| Notation | Meaning |
|---|---|
| f'(x) or df/dx | Derivative of f with respect to x |
| partial f / partial x | Partial derivative of f with respect to x |
| nabla f or nabla_x f | Gradient of f with respect to x |
| nabla^2 f or H | Hessian matrix of second derivatives |
| J or J_f | Jacobian matrix of vector-valued function f |
| argmin_x f(x) | Value of x that minimizes f(x) |
| argmax_x f(x) | Value of x that maximizes f(x) |
| x or theta | Optimal value of x or theta |
| eta | Learning rate |
| theta | Model parameters (generic) |
| theta_t | Parameters at optimization step t |
F.6 Neural Network Notation
| Notation | Meaning |
|---|---|
| f_theta or f(x; theta) | Neural network function parameterized by theta |
| W^{(l)} | Weight matrix of layer l |
| b^{(l)} | Bias vector of layer l |
| h^{(l)} | Hidden representation (activation) at layer l |
| a^{(l)} | Pre-activation value at layer l (before applying the activation function) |
| sigma(.) | Activation function (generic); also used for sigmoid when context is clear |
| L or J | Loss function |
| L(theta) | Loss as a function of parameters |
| L_CE | Cross-entropy loss |
| L_MSE | Mean squared error loss |
| y_hat or hat{y} | Model prediction |
| y | True label / ground truth |
| e_i | One-hot vector with 1 in position i |
| p_theta(y | x) | Model's predicted probability of y given input x |
| L | Number of layers |
| d | Dimensionality of representations (generic) |
| d_model | Model dimension (transformer hidden size) |
| d_k | Key/query dimension in attention |
| d_v | Value dimension in attention |
| d_ff | Feed-forward network inner dimension |
| h | Number of attention heads |
| N | Number of training examples |
| T | Sequence length |
| B | Batch size |
| V | Vocabulary size (when used in language modeling context) |
F.7 Transformer-Specific Notation
| Notation | Meaning |
|---|---|
| Q, K, V | Query, Key, Value matrices in attention |
| W_Q, W_K, W_V | Learned projection matrices for Q, K, V |
| W_O | Output projection matrix in multi-head attention |
| Attn(Q, K, V) | Attention function output |
| MHA(X) | Multi-head attention output |
| FFN(x) | Feed-forward network: typically W_2 * activation(W_1 * x + b_1) + b_2 |
| PE(x, pos) | Positional encoding added to token embedding at position pos |
| [CLS] | Special classification token (BERT-style) |
| [MASK] | Masked token placeholder (BERT-style) |
| [BOS], [EOS] | Beginning/end of sequence tokens |
| [PAD] | Padding token |
F.8 Greek Letters and Their Common Uses
| Letter | Name | Common Use in This Book |
|---|---|---|
| alpha | alpha | Learning rate, significance level, smoothing parameter, mixture weight |
| beta | beta | Momentum coefficient, inverse temperature, Beta distribution parameter |
| gamma | gamma | Discount factor, scale parameter in normalization |
| delta | delta | Small perturbation, Kronecker delta, change/difference |
| epsilon | epsilon | Small positive constant (e.g., for numerical stability), noise term, exploration rate |
| zeta | zeta | (Rarely used standalone) |
| eta | eta | Learning rate (primary notation in this book) |
| theta | theta | Model parameters (the most common use), angle |
| iota | iota | (Rarely used) |
| kappa | kappa | Condition number, Cohen's kappa |
| lambda | lambda | Regularization strength, Poisson rate parameter, eigenvalue |
| mu | mu | Mean of a distribution, learning rate in some contexts |
| nu | nu | Degrees of freedom (t-distribution, chi-square) |
| xi | xi | Random noise variable, auxiliary variable |
| pi | pi | The constant 3.14159...; also policy function in RL |
| rho | rho | Correlation coefficient, spectral radius |
| sigma | sigma | Standard deviation, sigmoid function, singular value, activation function (generic) |
| tau | tau | Temperature parameter, time constant, soft update rate |
| upsilon | upsilon | (Rarely used) |
| phi | phi | Feature function, angle, Gaussian CDF Phi(x), model parameters (alternative to theta) |
| chi | chi | Chi-square distribution |
| psi | psi | Auxiliary function, feature map |
| omega | omega | Frequency, angular velocity; Omega for sample space |
F.9 Subscripts and Superscripts Convention
| Convention | Meaning | Example |
|---|---|---|
| Subscript i, j, k | Index into a vector, matrix, or set | x_i, W_{ij} |
| Subscript t | Time step or token position | h_t, x_t |
| Superscript (l) | Layer index (in parentheses to distinguish from exponent) | W^{(l)}, h^{(l)} |
| Superscript T | Transpose | x^T, W^T |
| Superscript -1 | Inverse | A^{-1} |
| Superscript * | Optimal value | theta, x |
| Hat accent | Predicted / estimated value | y_hat, theta_hat |
| Bar accent | Mean / average | x_bar |
| Tilde accent | Modified or approximate version | theta_tilde, x_tilde |
F.10 Common Abbreviations in Equations
| Abbreviation | Expansion |
|---|---|
| s.t. | Subject to (in optimization constraints) |
| i.i.d. | Independent and identically distributed |
| w.r.t. | With respect to |
| a.s. | Almost surely |
| w.p. | With probability |
| iff | If and only if |
| WLOG | Without loss of generality |
| LHS / RHS | Left-hand side / Right-hand side |