Appendix F: Notation Guide

This appendix provides a complete reference for the mathematical notation conventions used throughout this textbook. When a symbol appears in multiple contexts, the intended meaning should be clear from the surrounding discussion. Appendix A provides the underlying mathematical definitions for many of these quantities.


F.1 Scalars, Vectors, Matrices, and Tensors

Convention Examples Description
Lowercase italic x, y, n, alpha, lambda Scalars (single numbers)
Bold lowercase x, w, b, h, z Vectors (assumed column vectors)
Bold uppercase W, X, A, K, V, Q Matrices
Bold uppercase script or calligraphic T, X Higher-order tensors (3D and above)
Uppercase italic N, D, T, L Scalar constants (dataset size, dimensionality, sequence length, number of layers)

Element Access

Notation Meaning
x_i The i-th element of vector x
W_{ij} Element in row i, column j of matrix W
X_{b,i,j} Element of a 3D tensor, indexed by batch b, row i, column j
x_{1:t} The subsequence of elements from index 1 through t
X_{i,:} The i-th row of matrix X (as a row vector)
X_{:,j} The j-th column of matrix X (as a column vector)

F.2 Sets and Spaces

Notation Meaning
R The set of real numbers
R^n n-dimensional real vector space
R^{m x n} The space of m-by-n real matrices
Z The set of integers
Z_+ The set of non-negative integers
N The set of natural numbers
{0, 1} A specific finite set
|S| Cardinality (number of elements) of set S
x in S x is an element of set S
S subset T S is a subset of T
S union T Union of sets S and T
S intersection T Intersection of sets S and T
emptyset The empty set
D A dataset: D = {(x_1, y_1), ..., (x_N, y_N)}
D_train, D_val, D_test Training, validation, and test splits of the dataset
X The input space
Y The output (label) space
V Vocabulary set (for language models)

F.3 Probability and Statistics

Notation Meaning
P(A) Probability of event A
P(A | B) Conditional probability of A given B
p(x) Probability density function (continuous) or probability mass function (discrete)
p(x | theta) Density/mass parameterized by theta
p_theta(x) Alternative notation for parameterized distribution
X ~ p Random variable X follows distribution p
X ~ N(mu, sigma^2) X follows a Gaussian with mean mu and variance sigma^2
E[X] or E_p[X] Expected value of X (under distribution p)
Var(X) Variance of X
Cov(X, Y) Covariance of X and Y
sigma Standard deviation
sigma^2 Variance
mu Mean
rho Correlation coefficient
H(X) Entropy of X
H(p, q) Cross-entropy between distributions p and q
D_KL(p || q) KL divergence from q to p
I(X; Y) Mutual information between X and Y

F.4 Linear Algebra Operations

Notation Meaning
x^T Transpose of vector x
W^T Transpose of matrix W
W^{-1} Inverse of matrix W
W^+ Moore-Penrose pseudoinverse of W
I or I_n Identity matrix (of size n if specified)
0 Zero vector or zero matrix (size inferred from context)
x . y or x^Ty Dot product (inner product) of x and y
x y^T Outer product of x and y
x circle y Hadamard (element-wise) product
A tensor B Kronecker product of A and B
||x||_2 L2 (Euclidean) norm: sqrt(sum_i x_i^2)
||x||_1 L1 (Manhattan) norm: sum_i |x_i|
||x||_p Lp norm: (sum_i |x_i|^p)^{1/p}
||A||_F Frobenius norm: sqrt(sum_{i,j} A_{ij}^2)
det(A) Determinant of A
tr(A) Trace of A: sum of diagonal elements
rank(A) Rank of A
diag(v) Diagonal matrix with entries from vector v
diag(A) Vector of diagonal elements of matrix A
lambda_i Eigenvalue (i-th)
sigma_i Singular value (i-th)
U, Sigma, V Matrices in SVD decomposition: A = U Sigma V^T

F.5 Calculus and Optimization

Notation Meaning
f'(x) or df/dx Derivative of f with respect to x
partial f / partial x Partial derivative of f with respect to x
nabla f or nabla_x f Gradient of f with respect to x
nabla^2 f or H Hessian matrix of second derivatives
J or J_f Jacobian matrix of vector-valued function f
argmin_x f(x) Value of x that minimizes f(x)
argmax_x f(x) Value of x that maximizes f(x)
x or theta Optimal value of x or theta
eta Learning rate
theta Model parameters (generic)
theta_t Parameters at optimization step t

F.6 Neural Network Notation

Notation Meaning
f_theta or f(x; theta) Neural network function parameterized by theta
W^{(l)} Weight matrix of layer l
b^{(l)} Bias vector of layer l
h^{(l)} Hidden representation (activation) at layer l
a^{(l)} Pre-activation value at layer l (before applying the activation function)
sigma(.) Activation function (generic); also used for sigmoid when context is clear
L or J Loss function
L(theta) Loss as a function of parameters
L_CE Cross-entropy loss
L_MSE Mean squared error loss
y_hat or hat{y} Model prediction
y True label / ground truth
e_i One-hot vector with 1 in position i
p_theta(y | x) Model's predicted probability of y given input x
L Number of layers
d Dimensionality of representations (generic)
d_model Model dimension (transformer hidden size)
d_k Key/query dimension in attention
d_v Value dimension in attention
d_ff Feed-forward network inner dimension
h Number of attention heads
N Number of training examples
T Sequence length
B Batch size
V Vocabulary size (when used in language modeling context)

F.7 Transformer-Specific Notation

Notation Meaning
Q, K, V Query, Key, Value matrices in attention
W_Q, W_K, W_V Learned projection matrices for Q, K, V
W_O Output projection matrix in multi-head attention
Attn(Q, K, V) Attention function output
MHA(X) Multi-head attention output
FFN(x) Feed-forward network: typically W_2 * activation(W_1 * x + b_1) + b_2
PE(x, pos) Positional encoding added to token embedding at position pos
[CLS] Special classification token (BERT-style)
[MASK] Masked token placeholder (BERT-style)
[BOS], [EOS] Beginning/end of sequence tokens
[PAD] Padding token

F.8 Greek Letters and Their Common Uses

Letter Name Common Use in This Book
alpha alpha Learning rate, significance level, smoothing parameter, mixture weight
beta beta Momentum coefficient, inverse temperature, Beta distribution parameter
gamma gamma Discount factor, scale parameter in normalization
delta delta Small perturbation, Kronecker delta, change/difference
epsilon epsilon Small positive constant (e.g., for numerical stability), noise term, exploration rate
zeta zeta (Rarely used standalone)
eta eta Learning rate (primary notation in this book)
theta theta Model parameters (the most common use), angle
iota iota (Rarely used)
kappa kappa Condition number, Cohen's kappa
lambda lambda Regularization strength, Poisson rate parameter, eigenvalue
mu mu Mean of a distribution, learning rate in some contexts
nu nu Degrees of freedom (t-distribution, chi-square)
xi xi Random noise variable, auxiliary variable
pi pi The constant 3.14159...; also policy function in RL
rho rho Correlation coefficient, spectral radius
sigma sigma Standard deviation, sigmoid function, singular value, activation function (generic)
tau tau Temperature parameter, time constant, soft update rate
upsilon upsilon (Rarely used)
phi phi Feature function, angle, Gaussian CDF Phi(x), model parameters (alternative to theta)
chi chi Chi-square distribution
psi psi Auxiliary function, feature map
omega omega Frequency, angular velocity; Omega for sample space

F.9 Subscripts and Superscripts Convention

Convention Meaning Example
Subscript i, j, k Index into a vector, matrix, or set x_i, W_{ij}
Subscript t Time step or token position h_t, x_t
Superscript (l) Layer index (in parentheses to distinguish from exponent) W^{(l)}, h^{(l)}
Superscript T Transpose x^T, W^T
Superscript -1 Inverse A^{-1}
Superscript * Optimal value theta, x
Hat accent Predicted / estimated value y_hat, theta_hat
Bar accent Mean / average x_bar
Tilde accent Modified or approximate version theta_tilde, x_tilde

F.10 Common Abbreviations in Equations

Abbreviation Expansion
s.t. Subject to (in optimization constraints)
i.i.d. Independent and identically distributed
w.r.t. With respect to
a.s. Almost surely
w.p. With probability
iff If and only if
WLOG Without loss of generality
LHS / RHS Left-hand side / Right-hand side