Appendix F: Notation Guide

This appendix provides a complete reference for the mathematical notation conventions used throughout this textbook. When a symbol appears in multiple contexts, the intended meaning should be clear from the surrounding discussion. Appendix A provides the underlying mathematical definitions for many of these quantities.

F.1 Scalars, Vectors, Matrices, and Tensors

Convention	Examples	Description
Lowercase italic	x, y, n, alpha, lambda	Scalars (single numbers)
Bold lowercase	x, w, b, h, z	Vectors (assumed column vectors)
Bold uppercase	W, X, A, K, V, Q	Matrices
Bold uppercase script or calligraphic	T, X	Higher-order tensors (3D and above)
Uppercase italic	N, D, T, L	Scalar constants (dataset size, dimensionality, sequence length, number of layers)

Element Access

Notation	Meaning
x_i	The i-th element of vector x
W_{ij}	Element in row i, column j of matrix W
X_{b,i,j}	Element of a 3D tensor, indexed by batch b, row i, column j
x_{1:t}	The subsequence of elements from index 1 through t
X_{i,:}	The i-th row of matrix X (as a row vector)
X_{:,j}	The j-th column of matrix X (as a column vector)

F.2 Sets and Spaces

Notation	Meaning
R	The set of real numbers
R^n	n-dimensional real vector space
R^{m x n}	The space of m-by-n real matrices
Z	The set of integers
Z_+	The set of non-negative integers
N	The set of natural numbers
{0, 1}	A specific finite set
\|S\|	Cardinality (number of elements) of set S
x in S	x is an element of set S
S subset T	S is a subset of T
S union T	Union of sets S and T
S intersection T	Intersection of sets S and T
emptyset	The empty set
D	A dataset: D = {(x_1, y_1), ..., (x_N, y_N)}
D_train, D_val, D_test	Training, validation, and test splits of the dataset
X	The input space
Y	The output (label) space
V	Vocabulary set (for language models)

F.3 Probability and Statistics

Notation	Meaning
P(A)	Probability of event A
P(A \| B)	Conditional probability of A given B
p(x)	Probability density function (continuous) or probability mass function (discrete)
p(x \| theta)	Density/mass parameterized by theta
p_theta(x)	Alternative notation for parameterized distribution
X ~ p	Random variable X follows distribution p
X ~ N(mu, sigma^2)	X follows a Gaussian with mean mu and variance sigma^2
E[X] or E_p[X]	Expected value of X (under distribution p)
Var(X)	Variance of X
Cov(X, Y)	Covariance of X and Y
sigma	Standard deviation
sigma^2	Variance
mu	Mean
rho	Correlation coefficient
H(X)	Entropy of X
H(p, q)	Cross-entropy between distributions p and q
D_KL(p \|\| q)	KL divergence from q to p
I(X; Y)	Mutual information between X and Y

F.4 Linear Algebra Operations

Notation	Meaning
x^T	Transpose of vector x
W^T	Transpose of matrix W
W^{-1}	Inverse of matrix W
W^+	Moore-Penrose pseudoinverse of W
I or I_n	Identity matrix (of size n if specified)
0	Zero vector or zero matrix (size inferred from context)
x . y or x^Ty	Dot product (inner product) of x and y
x y^T	Outer product of x and y
x circle y	Hadamard (element-wise) product
A tensor B	Kronecker product of A and B
\|\|x\|\|_2	L2 (Euclidean) norm: sqrt(sum_i x_i^2)
\|\|x\|\|_1	L1 (Manhattan) norm: sum_i \|x_i\|
\|\|x\|\|_p	Lp norm: (sum_i \|x_i\|^p)^{1/p}
\|\|A\|\|_F	Frobenius norm: sqrt(sum_{i,j} A_{ij}^2)
det(A)	Determinant of A
tr(A)	Trace of A: sum of diagonal elements
rank(A)	Rank of A
diag(v)	Diagonal matrix with entries from vector v
diag(A)	Vector of diagonal elements of matrix A
lambda_i	Eigenvalue (i-th)
sigma_i	Singular value (i-th)
U, Sigma, V	Matrices in SVD decomposition: A = U Sigma V^T

F.5 Calculus and Optimization

Notation	Meaning
f'(x) or df/dx	Derivative of f with respect to x
partial f / partial x	Partial derivative of f with respect to x
nabla f or nabla_x f	Gradient of f with respect to x
nabla^2 f or H	Hessian matrix of second derivatives
J or J_f	Jacobian matrix of vector-valued function f
argmin_x f(x)	Value of x that minimizes f(x)
argmax_x f(x)	Value of x that maximizes f(x)
x or theta	Optimal value of x or theta
eta	Learning rate
theta	Model parameters (generic)
theta_t	Parameters at optimization step t

F.6 Neural Network Notation

Notation	Meaning
f_theta or f(x; theta)	Neural network function parameterized by theta
W^{(l)}	Weight matrix of layer l
b^{(l)}	Bias vector of layer l
h^{(l)}	Hidden representation (activation) at layer l
a^{(l)}	Pre-activation value at layer l (before applying the activation function)
sigma(.)	Activation function (generic); also used for sigmoid when context is clear
L or J	Loss function
L(theta)	Loss as a function of parameters
L_CE	Cross-entropy loss
L_MSE	Mean squared error loss
y_hat or hat{y}	Model prediction
y	True label / ground truth
e_i	One-hot vector with 1 in position i
p_theta(y \| x)	Model's predicted probability of y given input x
L	Number of layers
d	Dimensionality of representations (generic)
d_model	Model dimension (transformer hidden size)
d_k	Key/query dimension in attention
d_v	Value dimension in attention
d_ff	Feed-forward network inner dimension
h	Number of attention heads
N	Number of training examples
T	Sequence length
B	Batch size
V	Vocabulary size (when used in language modeling context)

F.7 Transformer-Specific Notation

Notation	Meaning
Q, K, V	Query, Key, Value matrices in attention
W_Q, W_K, W_V	Learned projection matrices for Q, K, V
W_O	Output projection matrix in multi-head attention
Attn(Q, K, V)	Attention function output
MHA(X)	Multi-head attention output
FFN(x)	Feed-forward network: typically W_2 * activation(W_1 * x + b_1) + b_2
PE(x, pos)	Positional encoding added to token embedding at position pos
[CLS]	Special classification token (BERT-style)
[MASK]	Masked token placeholder (BERT-style)
[BOS], [EOS]	Beginning/end of sequence tokens
[PAD]	Padding token

F.8 Greek Letters and Their Common Uses

Letter	Name	Common Use in This Book
alpha	alpha	Learning rate, significance level, smoothing parameter, mixture weight
beta	beta	Momentum coefficient, inverse temperature, Beta distribution parameter
gamma	gamma	Discount factor, scale parameter in normalization
delta	delta	Small perturbation, Kronecker delta, change/difference
epsilon	epsilon	Small positive constant (e.g., for numerical stability), noise term, exploration rate
zeta	zeta	(Rarely used standalone)
eta	eta	Learning rate (primary notation in this book)
theta	theta	Model parameters (the most common use), angle
iota	iota	(Rarely used)
kappa	kappa	Condition number, Cohen's kappa
lambda	lambda	Regularization strength, Poisson rate parameter, eigenvalue
mu	mu	Mean of a distribution, learning rate in some contexts
nu	nu	Degrees of freedom (t-distribution, chi-square)
xi	xi	Random noise variable, auxiliary variable
pi	pi	The constant 3.14159...; also policy function in RL
rho	rho	Correlation coefficient, spectral radius
sigma	sigma	Standard deviation, sigmoid function, singular value, activation function (generic)
tau	tau	Temperature parameter, time constant, soft update rate
upsilon	upsilon	(Rarely used)
phi	phi	Feature function, angle, Gaussian CDF Phi(x), model parameters (alternative to theta)
chi	chi	Chi-square distribution
psi	psi	Auxiliary function, feature map
omega	omega	Frequency, angular velocity; Omega for sample space

F.9 Subscripts and Superscripts Convention

Convention	Meaning	Example
Subscript i, j, k	Index into a vector, matrix, or set	x_i, W_{ij}
Subscript t	Time step or token position	h_t, x_t
Superscript (l)	Layer index (in parentheses to distinguish from exponent)	W^{(l)}, h^{(l)}
Superscript T	Transpose	x^T, W^T
Superscript -1	Inverse	A^{-1}
Superscript *	Optimal value	theta, x
Hat accent	Predicted / estimated value	y_hat, theta_hat
Bar accent	Mean / average	x_bar
Tilde accent	Modified or approximate version	theta_tilde, x_tilde

F.10 Common Abbreviations in Equations

Abbreviation	Expansion
s.t.	Subject to (in optimization constraints)
i.i.d.	Independent and identically distributed
w.r.t.	With respect to
a.s.	Almost surely
w.p.	With probability
iff	If and only if
WLOG	Without loss of generality
LHS / RHS	Left-hand side / Right-hand side