Appendix A: Mathematical Foundations

This appendix provides a concise reference for the mathematical concepts used throughout this textbook. It is not intended as a standalone tutorial; rather, it serves as a refresher and a place to look up notation, definitions, and key results referenced in the main chapters.


A.1 Notation Guide

Throughout this book we adopt the following conventions. Scalars are denoted by lowercase italic letters (e.g., x, n, alpha). Vectors are denoted by bold lowercase letters (e.g., x, w, b) and are assumed to be column vectors unless otherwise stated. Matrices are denoted by bold uppercase letters (e.g., W, X, A). Sets are denoted by calligraphic uppercase letters (e.g., S, D, X).

Common Symbols

Symbol Meaning
x_i The i-th element of vector x
W_{ij} The element in row i, column j of matrix W
x^T Transpose of vector x
W^{-1} Inverse of matrix W
||x||_2 L2 (Euclidean) norm of x
||x||_1 L1 (Manhattan) norm of x
sum_{i=1}^{n} x_i Summation: x_1 + x_2 + ... + x_n
prod_{i=1}^{n} x_i Product: x_1 * x_2 * ... * x_n
E[X] Expected value of random variable X
Var(X) Variance of random variable X
Cov(X, Y) Covariance of X and Y
P(A) Probability of event A
P(A | B) Conditional probability of A given B
p(x) Probability density (or mass) function evaluated at x
argmax_x f(x) Value of x that maximizes f(x)
argmin_x f(x) Value of x that minimizes f(x)
nabla f Gradient of f
partial f / partial x Partial derivative of f with respect to x
O(n) Big-O notation: asymptotic upper bound
R^n n-dimensional real coordinate space
I_n n x n identity matrix
diag(v) Diagonal matrix with entries from vector v
det(A) Determinant of matrix A
tr(A) Trace of matrix A

A.2 Probability Fundamentals

A.2.1 Axioms of Probability

For a sample space Omega and an event A that is a subset of Omega:

  1. Non-negativity: P(A) >= 0 for all events A.
  2. Normalization: P(Omega) = 1.
  3. Additivity: For mutually exclusive events A and B, P(A union B) = P(A) + P(B).

A.2.2 Basic Rules

Complement Rule: P(A^c) = 1 - P(A).

Addition Rule: P(A union B) = P(A) + P(B) - P(A intersection B).

Multiplication Rule: P(A intersection B) = P(A | B) * P(B) = P(B | A) * P(A).

Law of Total Probability: If B_1, B_2, ..., B_n form a partition of Omega, then P(A) = sum_{i=1}^{n} P(A | B_i) * P(B_i).

Independence: Events A and B are independent if and only if P(A intersection B) = P(A) * P(B), which equivalently means P(A | B) = P(A).

A.2.3 Bayes' Theorem

Bayes' theorem is the single most important result in probabilistic machine learning. It provides a principled way to update beliefs in light of new evidence:

P(H | D) = P(D | H) * P(H) / P(D)

where: - P(H | D) is the posterior probability of hypothesis H given data D. - P(D | H) is the likelihood of observing data D if H is true. - P(H) is the prior probability of H before observing data. - P(D) is the evidence (or marginal likelihood), which serves as a normalizing constant.

In practice, P(D) is often computed via the law of total probability: P(D) = sum_i P(D | H_i) * P(H_i).

Extended form for continuous parameters: Given a parameter theta and observed data D:

p(theta | D) = p(D | theta) * p(theta) / p(D)

This formulation is the backbone of Bayesian inference discussed in Chapters 5 and 18.

A.2.4 Expected Value and Variance

For a discrete random variable X with probability mass function p(x):

  • Expected value: E[X] = sum_x x * p(x)
  • Variance: Var(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2
  • Standard deviation: sigma = sqrt(Var(X))

For a continuous random variable X with probability density function f(x):

  • Expected value: E[X] = integral from -inf to +inf of x * f(x) dx
  • Variance: Var(X) = integral from -inf to +inf of (x - E[X])^2 * f(x) dx

Properties of expectation: - E[aX + b] = a * E[X] + b (linearity) - E[X + Y] = E[X] + E[Y] (always, even if X and Y are dependent) - E[XY] = E[X] * E[Y] only if X and Y are independent

Properties of variance: - Var(aX + b) = a^2 * Var(X) - Var(X + Y) = Var(X) + Var(Y) + 2 * Cov(X, Y) - If X and Y are independent, Var(X + Y) = Var(X) + Var(Y)

A.2.5 Information-Theoretic Quantities

Entropy of a discrete random variable X: H(X) = -sum_x p(x) * log p(x).

Cross-entropy between distributions p and q: H(p, q) = -sum_x p(x) * log q(x).

KL Divergence from q to p: D_KL(p || q) = sum_x p(x) * log(p(x) / q(x)) = H(p, q) - H(p).

KL divergence is always non-negative and equals zero if and only if p = q (Gibbs' inequality). It is not symmetric: D_KL(p || q) != D_KL(q || p) in general.


A.3 Common Probability Distributions

The following table summarizes the distributions most frequently encountered in this textbook.

Discrete Distributions

Distribution PMF / PDF Mean Variance Used In
Bernoulli(p) P(X=1) = p, P(X=0) = 1-p p p(1-p) Binary classification (Ch. 4)
Binomial(n, p) C(n,k) * p^k * (1-p)^{n-k} np np(1-p) Counting successes (Ch. 5)
Poisson(lambda) (lambda^k * e^{-lambda}) / k! lambda lambda Count data, rare events (Ch. 5)
Categorical(p_1,...,p_K) P(X=k) = p_k -- -- Multi-class classification (Ch. 4)
Geometric(p) (1-p)^{k-1} * p 1/p (1-p)/p^2 Waiting times (Ch. 5)

Continuous Distributions

Distribution PDF Mean Variance Used In
Uniform(a, b) 1/(b-a) for a <= x <= b (a+b)/2 (b-a)^2/12 Random initialization (Ch. 7)
Gaussian(mu, sigma^2) (1/sqrt(2pisigma^2)) * exp(-(x-mu)^2 / (2*sigma^2)) mu sigma^2 Ubiquitous (Ch. 3-40)
Exponential(lambda) lambda * exp(-lambda*x) for x >= 0 1/lambda 1/lambda^2 Waiting times (Ch. 5)
Beta(alpha, beta) x^{alpha-1}*(1-x)^{beta-1}/B(alpha,beta) alpha/(alpha+beta) alphabeta/((alpha+beta)^2(alpha+beta+1)) Bayesian priors (Ch. 18)
Gamma(alpha, beta) (beta^alpha / Gamma(alpha)) * x^{alpha-1} * exp(-beta*x) alpha/beta alpha/beta^2 Bayesian priors (Ch. 18)

The multivariate Gaussian distribution in R^d with mean vector mu and covariance matrix Sigma has the density:

p(**x**) = (1 / ((2*pi)^{d/2} * det(Sigma)^{1/2})) * exp(-0.5 * (**x** - **mu**)^T * Sigma^{-1} * (**x** - **mu**))

This distribution is central to Gaussian processes (Chapter 18) and variational autoencoders (Chapter 14).


A.4 Linear Algebra Essentials

A.4.1 Vectors and Vector Operations

A vector x in R^n is an ordered tuple of n real numbers. Key operations:

Dot product: x . y = sum_{i=1}^{n} x_i * y_i = x^T y. The dot product is a scalar.

Euclidean norm: ||x||2 = sqrt(sum x_i^2) = sqrt(}^{nx^T x).

Cosine similarity: cos(theta) = (x . y) / (||x||_2 * ||y||_2). This measures the angle between two vectors and is widely used in embedding similarity (Chapters 10, 20, 26).

Outer product: x y^T produces an n x m matrix when x is in R^n and y is in R^m.

Hadamard (element-wise) product: (x circle y)_i = x_i * y_i.

A.4.2 Matrices

An m x n matrix A maps vectors from R^n to R^m via left-multiplication: y = Ax.

Matrix multiplication: If A is m x n and B is n x p, then C = AB is m x p, with C_{ij} = sum_{k=1}^{n} A_{ik} * B_{kj}.

Transpose: (A^T){ij} = A. Properties: (AB)^T = B^T A^T.

Symmetric matrix: A = A^T. Real symmetric matrices have real eigenvalues and orthogonal eigenvectors.

Positive definite matrix: A symmetric matrix A is positive definite if x^T A x > 0 for all non-zero x. Covariance matrices are always positive semi-definite.

Orthogonal matrix: Q^T Q = Q Q^T = I. Columns of Q form an orthonormal basis.

A.4.3 Eigenvalues and Eigenvectors

A non-zero vector v is an eigenvector of square matrix A if Av = lambda * v for some scalar lambda (the eigenvalue). Key facts:

  • An n x n matrix has at most n eigenvalues (counted with multiplicity).
  • For symmetric matrices, all eigenvalues are real and eigenvectors corresponding to distinct eigenvalues are orthogonal.
  • tr(A) = sum of eigenvalues. det(A) = product of eigenvalues.
  • A positive definite matrix has all positive eigenvalues.

A.4.4 Singular Value Decomposition (SVD)

Any m x n real matrix A can be decomposed as A = U Sigma V^T, where: - U is m x m orthogonal (left singular vectors), - Sigma is m x n diagonal with non-negative entries (singular values sigma_1 >= sigma_2 >= ... >= 0), - V is n x n orthogonal (right singular vectors).

The SVD is fundamental to dimensionality reduction (PCA, Chapter 6), low-rank approximations, and understanding the geometry of linear transformations. The best rank-k approximation to A (in the Frobenius or spectral norm) is obtained by keeping only the top k singular values and their corresponding singular vectors.

A.4.5 Matrix Calculus

For a function f: R^{m x n} -> R, the gradient with respect to matrix W is the matrix of partial derivatives: (nabla_W f){ij} = partial f / partial W.

Key identities used in neural network backpropagation: - nabla_x (a^T x) = a - nabla_x (x^T A x) = (A + A^T) x (and = 2Ax if A is symmetric) - nabla_W (tr(AWB)) = A^T B^T


A.5 Calculus Reference

A.5.1 Common Derivatives

Function f(x) Derivative f'(x)
x^n n * x^{n-1}
e^x e^x
ln(x) 1/x
sin(x) cos(x)
cos(x) -sin(x)
1/(1 + e^{-x}) (sigmoid) sigmoid(x) * (1 - sigmoid(x))
tanh(x) 1 - tanh^2(x)
max(0, x) (ReLU) 1 if x > 0, 0 if x < 0 (undefined at 0)
softmax(x)_i softmax(x)i * (delta - softmax(x)_j)
log(sum_j exp(x_j)) (logsumexp) softmax(x)

A.5.2 Chain Rule

If y = f(g(x)), then dy/dx = f'(g(x)) * g'(x).

For multivariate functions, if y = f(g(x)) where g: R^n -> R^m and f: R^m -> R, then:

partial f / partial x_j = sum_{i=1}^{m} (partial f / partial g_i) * (partial g_i / partial x_j)

In matrix notation: nabla_x f = J_g^T * nabla_g f, where J_g is the m x n Jacobian matrix of g.

The chain rule is the mathematical foundation of the backpropagation algorithm (Chapter 7). In deep networks with L layers, the gradient of the loss with respect to parameters in layer l involves a chain of L - l + 1 matrix multiplications.

A.5.3 Gradient and Optimization

The gradient nabla f(x) of a scalar function f: R^n -> R points in the direction of steepest ascent. Gradient descent updates parameters by moving in the opposite direction:

**x**_{t+1} = **x**_t - eta * nabla f(**x**_t)

where eta > 0 is the learning rate.

Second-order information: The Hessian matrix H has entries H_{ij} = partial^2 f / (partial x_i * partial x_j). Newton's method uses the Hessian: x_{t+1} = x_t - H^{-1} * nabla f(x_t). This converges faster but is impractical for large-scale neural networks due to the cost of computing and inverting the Hessian.

Convexity: A function f is convex if f(lambda x + (1 - lambda) y) <= lambda f(x) + (1 - lambda) f(y) for all lambda in [0, 1]. For convex functions, any local minimum is a global minimum. Neural network loss functions are generally non-convex.

A.5.4 Common Integrals

Integral Result
integral of x^n dx x^{n+1}/(n+1) + C (n != -1)
integral of e^{ax} dx e^{ax}/a + C
integral of 1/x dx ln|x| + C
integral from -inf to +inf of e^{-x^2/2} dx sqrt(2*pi)
integral from 0 to inf of x^{n-1} e^{-x} dx Gamma(n)

The Gaussian integral (second to last row) is fundamental to normalizing the Gaussian distribution.


A.6 Key Proofs Referenced in the Main Text

A.6.1 Cross-Entropy Loss Derives from Maximum Likelihood (Chapter 4)

Claim: Minimizing cross-entropy loss for classification is equivalent to maximum likelihood estimation.

Proof: For a dataset of N samples with true labels y_i in {1, ..., K} and model predictions p(y | x_i; theta), the log-likelihood is:

log L(theta) = sum_{i=1}^{N} log p(y_i | x_i; theta)

The negative log-likelihood is:

-log L(theta) = -sum_{i=1}^{N} log p(y_i | x_i; theta) = sum_{i=1}^{N} H(delta_{y_i}, p(. | x_i; theta))

where delta_{y_i} is the one-hot distribution placing all mass on y_i, and H(p, q) is the cross-entropy. Therefore minimizing cross-entropy over the training set is exactly maximum likelihood estimation.

A.6.2 Universal Approximation Theorem (Chapter 7)

Theorem (Cybenko, 1989; Hornik, 1991): Let sigma be any continuous, non-constant, bounded activation function. Then for any continuous function f on a compact subset of R^n and any epsilon > 0, there exists a single-hidden-layer neural network g(x) = sum_{j=1}^{M} alpha_j * sigma(w_j^T x + b_j) such that |f(x) - g(x)| < epsilon for all x in the compact set.

Significance: This theorem guarantees that neural networks are universal function approximators. However, it says nothing about how large M must be or how easy it is to find the weights by optimization. Modern deep learning succeeds in practice because depth provides exponentially more efficient representations than width for many function classes.

A.6.3 Softmax Is Invariant to Constant Shifts (Chapter 8)

Claim: softmax(z + c * 1) = softmax(z) for any scalar c.

Proof: softmax(z + c)_i = exp(z_i + c) / sum_j exp(z_j + c) = exp(z_i) * exp(c) / (exp(c) * sum_j exp(z_j)) = exp(z_i) / sum_j exp(z_j) = softmax(z)_i.

This property is exploited numerically: before computing softmax, subtract max(z) to prevent overflow.

A.6.4 The Reparameterization Trick (Chapter 14)

Problem: In variational autoencoders (VAEs), we need to backpropagate through a stochastic sampling step z ~ N(mu, sigma^2).

Solution: Write z = mu + sigma * epsilon where epsilon ~ N(0, 1). Now z is a deterministic, differentiable function of mu and sigma (given epsilon), so standard backpropagation applies.

Why it works: The transformation preserves the distribution: if epsilon ~ N(0, 1), then mu + sigma * epsilon ~ N(mu, sigma^2). The gradient nabla_{mu, sigma} f(z) can be computed via the chain rule through z = mu + sigma * epsilon, and averaged over samples of epsilon to estimate E[nabla f(z)].

A.6.5 Attention Is Permutation-Equivariant (Chapter 11)

Claim: The self-attention operation is equivariant to permutations of the input sequence.

Proof: Let X be an n x d input matrix (n tokens, d dimensions) and let P be a permutation matrix. Self-attention computes:

Attn(**X**) = softmax(**X** **W_Q** (**X** **W_K**)^T / sqrt(d_k)) * **X** **W_V**

Substituting PX:

Attn(**P****X**) = softmax(**P****X** **W_Q** (**P****X** **W_K**)^T / sqrt(d_k)) * **P****X** **W_V**
    = softmax(**P** * **X** **W_Q** **W_K**^T **X**^T * **P**^T / sqrt(d_k)) * **P** * **X** **W_V**
    = **P** * softmax(**X** **W_Q** **W_K**^T **X**^T / sqrt(d_k)) * **X** **W_V**
    = **P** * Attn(**X**)

The permutation passes through because softmax is applied row-wise, and P * softmax(A * P^T) = P * softmax(P^T applied to columns) yields the correctly permuted attention weights. This is why transformers require positional encodings to distinguish token order.

A.6.6 KL Divergence Is Non-Negative (Gibbs' Inequality)

Claim: D_KL(p || q) >= 0 with equality if and only if p = q.

Proof: By Jensen's inequality applied to the concave function log:

-D_KL(p || q) = sum_x p(x) * log(q(x) / p(x)) <= log(sum_x p(x) * q(x) / p(x)) = log(sum_x q(x)) = log(1) = 0

Therefore D_KL(p || q) >= 0. Equality holds if and only if q(x)/p(x) is constant for all x in the support of p, which (given both are distributions) requires p = q.


A.7 Useful Inequalities

Inequality Statement Relevance
Cauchy-Schwarz |x^T y| <= ||x|| * ||y|| Bounding dot products (Ch. 10)
Jensen's f(E[X]) <= E[f(X)] for convex f Variational inference (Ch. 14, 18)
Markov's P(X >= a) <= E[X]/a for X >= 0, a > 0 Concentration bounds (Ch. 5)
Chebyshev's P(|X - mu| >= k*sigma) <= 1/k^2 Outlier analysis (Ch. 5)
Hoeffding's P(|sample_mean - mu| >= t) <= 2*exp(-2nt^2/(b-a)^2) Generalization bounds (Ch. 6)
Triangle inequality ||x + y|| <= ||x|| + ||y|| Metric properties (Ch. 10)

A.8 Approximations and Series Expansions

Taylor series of f(x) about x = a:

f(x) = f(a) + f'(a)(x-a) + f''(a)(x-a)^2/2! + ...

Key expansions: - e^x = 1 + x + x^2/2! + x^3/3! + ... - ln(1 + x) = x - x^2/2 + x^3/3 - ... for |x| < 1 - (1 + x)^n approx 1 + nx for small x - sigmoid(x) approx 0.5 + x/4 for small x

Stirling's approximation: n! approx sqrt(2pin) * (n/e)^n, or equivalently ln(n!) approx n*ln(n) - n. This is used when analyzing combinatorial quantities in information theory (Chapter 3) and statistical mechanics interpretations of learning.