Appendix A: Mathematical Notation Reference

A unified guide to every symbol, convention, and notational pattern used in this textbook.

This appendix consolidates the mathematical notation used across all 39 chapters into a single reference. When a symbol carries different meanings in different subfields (e.g., $p$ for probability density vs. $p$ for a prime number), context is noted explicitly. Page references point to where each symbol is first introduced.

A.1 General Conventions

Convention	Meaning
Lowercase italic ($x, y, \alpha$)	Scalars
Lowercase bold ($\mathbf{x}, \mathbf{w}, \boldsymbol{\mu}$)	Column vectors
Uppercase bold ($\mathbf{A}, \mathbf{W}, \boldsymbol{\Sigma}$)	Matrices
Uppercase calligraphic ($\mathcal{D}, \mathcal{L}, \mathcal{X}$)	Sets, spaces, loss functions
Uppercase blackboard bold ($\mathbb{R}, \mathbb{E}, \mathbb{P}$)	Number sets, expectation, probability
Superscript $(i)$ ($x^{(i)}, \mathbf{x}^{(i)}$)	Index over data points (sample $i$)
Subscript $j$ ($x_j, w_j$)	Index over features or dimensions (component $j$)
Superscript $[l]$ ($\mathbf{W}^{[l]}$)	Layer index in a neural network
Hat ($\hat{y}, \hat{\theta}$)	Estimated or predicted quantity
Star ($\theta^, \mathbf{x}^$)	Optimal value
Bar ($\bar{x}$)	Sample mean
Tilde ($\tilde{x}, \tilde{\mathbf{x}}$)	Noisy, corrupted, or alternative version
$:=$	Defined as

A.2 Sets and Number Systems

Symbol	Meaning
$\mathbb{R}$	Real numbers
$\mathbb{R}^n$	$n$-dimensional real vector space
$\mathbb{R}^{m \times n}$	Space of $m \times n$ real matrices
$\mathbb{R}_{\geq 0}$	Non-negative reals
$\mathbb{Z}$	Integers
$\mathbb{N}$	Natural numbers $\{0, 1, 2, \ldots\}$
$\emptyset$	Empty set
$\lvert S \rvert$	Cardinality of set $S$
$S^c$	Complement of set $S$
$\in, \notin$	Element of, not an element of
$\subset, \subseteq$	Proper subset, subset
$\cup, \cap$	Union, intersection
$\setminus$	Set difference: $A \setminus B = \{x \in A : x \notin B\}$
$\times$	Cartesian product
$2^S$	Power set of $S$

A.3 Linear Algebra

Vectors

Symbol	Meaning
$\mathbf{x} \in \mathbb{R}^n$	Column vector with $n$ components
$\mathbf{x}^\top$	Row vector (transpose of $\mathbf{x}$)
$x_j$ or $[\mathbf{x}]_j$	$j$-th component of $\mathbf{x}$
$\mathbf{e}_j$	Standard basis vector ($1$ in position $j$, $0$ elsewhere)
$\mathbf{0}$	Zero vector
$\mathbf{1}$	All-ones vector
$\mathbf{x} \cdot \mathbf{y}$ or $\mathbf{x}^\top \mathbf{y}$	Dot product $\sum_j x_j y_j$
$\mathbf{x} \odot \mathbf{y}$	Hadamard (element-wise) product
$\lVert \mathbf{x} \rVert_2$	Euclidean ($\ell_2$) norm: $\sqrt{\sum_j x_j^2}$
$\lVert \mathbf{x} \rVert_1$	$\ell_1$ norm: $\sum_j \lvert x_j \rvert$
$\lVert \mathbf{x} \rVert_\infty$	$\ell_\infty$ norm: $\max_j \lvert x_j \rvert$
$\lVert \mathbf{x} \rVert_p$	$\ell_p$ norm: $\left(\sum_j \lvert x_j \rvert^p\right)^{1/p}$

Matrices

Symbol	Meaning
$\mathbf{A} \in \mathbb{R}^{m \times n}$	Matrix with $m$ rows and $n$ columns
$A_{ij}$ or $[\mathbf{A}]_{ij}$	Entry in row $i$, column $j$
$\mathbf{A}^\top$	Transpose: $[\mathbf{A}^\top]_{ij} = A_{ji}$
$\mathbf{A}^{-1}$	Matrix inverse (when $\mathbf{A}$ is square and non-singular)
$\mathbf{A}^{\dagger}$	Moore--Penrose pseudoinverse
$\mathbf{I}$ or $\mathbf{I}_n$	$n \times n$ identity matrix
$\operatorname{diag}(\mathbf{x})$	Diagonal matrix with $\mathbf{x}$ on its diagonal
$\operatorname{tr}(\mathbf{A})$	Trace: $\sum_i A_{ii}$
$\det(\mathbf{A})$ or $\lvert \mathbf{A} \rvert$	Determinant
$\operatorname{rank}(\mathbf{A})$	Rank
$\lVert \mathbf{A} \rVert_F$	Frobenius norm: $\sqrt{\sum_{i,j} A_{ij}^2}$
$\lVert \mathbf{A} \rVert_2$	Spectral norm: largest singular value
$\lVert \mathbf{A} \rVert_*$	Nuclear norm: sum of singular values
$\mathbf{A} \succeq 0$	$\mathbf{A}$ is positive semidefinite
$\mathbf{A} \succ 0$	$\mathbf{A}$ is positive definite
$\otimes$	Kronecker product

Decompositions

Decomposition	Form	Notes
Eigendecomposition	$\mathbf{A} = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^{-1}$	$\boldsymbol{\Lambda} = \operatorname{diag}(\lambda_1, \ldots, \lambda_n)$; requires $\mathbf{A}$ square
Symmetric eigen	$\mathbf{A} = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^\top$	$\mathbf{Q}$ orthogonal when $\mathbf{A} = \mathbf{A}^\top$
SVD	$\mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top$	$\mathbf{U} \in \mathbb{R}^{m \times m}$, $\boldsymbol{\Sigma} \in \mathbb{R}^{m \times n}$, $\mathbf{V} \in \mathbb{R}^{n \times n}$; singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$
Truncated SVD	$\mathbf{A} \approx \mathbf{U}_k \boldsymbol{\Sigma}_k \mathbf{V}_k^\top$	Best rank-$k$ approximation (Eckart--Young theorem)
Cholesky	$\mathbf{A} = \mathbf{L}\mathbf{L}^\top$	$\mathbf{L}$ lower triangular; requires $\mathbf{A} \succ 0$
QR	$\mathbf{A} = \mathbf{Q}\mathbf{R}$	$\mathbf{Q}$ orthogonal, $\mathbf{R}$ upper triangular
LU	$\mathbf{A} = \mathbf{L}\mathbf{U}$	$\mathbf{L}$ lower triangular, $\mathbf{U}$ upper triangular (with pivoting)

A.4 Calculus and Analysis

Derivatives

Symbol	Meaning
$\frac{df}{dx}$ or $f'(x)$	Derivative of scalar function $f$ with respect to scalar $x$
$\frac{\partial f}{\partial x_j}$	Partial derivative of $f$ with respect to $x_j$
$\nabla f(\mathbf{x})$ or $\nabla_{\mathbf{x}} f$	Gradient: $\left[\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right]^\top \in \mathbb{R}^n$
$\nabla^2 f(\mathbf{x})$ or $\mathbf{H}$	Hessian: $[\mathbf{H}]_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} \in \mathbb{R}^{n \times n}$
$\mathbf{J}$ or $\frac{\partial \mathbf{f}}{\partial \mathbf{x}}$	Jacobian of $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$: $[\mathbf{J}]_{ij} = \frac{\partial f_i}{\partial x_j} \in \mathbb{R}^{m \times n}$
$\frac{\partial \mathcal{L}}{\partial \mathbf{W}}$	Matrix derivative: $\left[\frac{\partial \mathcal{L}}{\partial W_{ij}}\right] \in \mathbb{R}^{m \times n}$
$\nabla_{\mathbf{x}} f \big\rvert_{\mathbf{x}=\mathbf{a}}$	Gradient evaluated at $\mathbf{x} = \mathbf{a}$

Common Identities Used in This Textbook

$$ \nabla_{\mathbf{x}} (\mathbf{a}^\top \mathbf{x}) = \mathbf{a} $$

$$ \nabla_{\mathbf{x}} (\mathbf{x}^\top \mathbf{A} \mathbf{x}) = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x} $$

$$ \nabla_{\mathbf{x}} \lVert \mathbf{x} \rVert_2^2 = 2\mathbf{x} $$

$$ \frac{\partial}{\partial \mathbf{X}} \operatorname{tr}(\mathbf{A}\mathbf{X}\mathbf{B}) = \mathbf{A}^\top \mathbf{B}^\top $$

Chain Rule (Multivariable)

For $f: \mathbb{R}^n \to \mathbb{R}$ and $\mathbf{g}: \mathbb{R}^m \to \mathbb{R}^n$:

$$ \frac{\partial f}{\partial \mathbf{x}} = \frac{\partial f}{\partial \mathbf{g}} \cdot \frac{\partial \mathbf{g}}{\partial \mathbf{x}} = \mathbf{J}_{\mathbf{g}}^\top \nabla_{\mathbf{g}} f $$

This is the mathematical foundation of backpropagation (Chapters 4--5).

A.5 Probability and Statistics

Core Notation

Symbol	Meaning
$\Omega$	Sample space
$P(A)$ or $\mathbb{P}(A)$	Probability of event $A$
$p(x)$	Probability density function (continuous) or probability mass function (discrete)
$p(x, y)$	Joint distribution of $x$ and $y$
$p(x \mid y)$	Conditional distribution of $x$ given $y$
$p(x; \theta)$ or $p_\theta(x)$	Distribution of $x$ parameterized by $\theta$
$X \sim p$	Random variable $X$ distributed according to $p$
$X \perp\!\!\!\perp Y$	$X$ and $Y$ are independent
$X \perp\!\!\!\perp Y \mid Z$	$X$ and $Y$ are conditionally independent given $Z$
$F(x) = P(X \leq x)$	Cumulative distribution function (CDF)
$F^{-1}(q)$	Quantile function (inverse CDF)

Expectations and Moments

Symbol	Meaning
$\mathbb{E}[X]$ or $\mathbb{E}_{p}[X]$	Expected value of $X$ under distribution $p$
$\mathbb{E}[X \mid Y]$	Conditional expectation
$\operatorname{Var}(X)$	Variance: $\mathbb{E}[(X - \mathbb{E}[X])^2]$
$\operatorname{Std}(X)$ or $\sigma_X$	Standard deviation: $\sqrt{\operatorname{Var}(X)}$
$\operatorname{Cov}(X, Y)$	Covariance: $\mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]$
$\operatorname{Corr}(X, Y)$ or $\rho_{XY}$	Correlation: $\frac{\operatorname{Cov}(X,Y)}{\sigma_X \sigma_Y}$
$\boldsymbol{\Sigma}$	Covariance matrix: $[\boldsymbol{\Sigma}]_{ij} = \operatorname{Cov}(X_i, X_j)$

Named Distributions

Notation	Distribution	Parameters
$\mathcal{N}(\mu, \sigma^2)$	Univariate normal	Mean $\mu$, variance $\sigma^2$
$\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$	Multivariate normal	Mean vector $\boldsymbol{\mu}$, covariance matrix $\boldsymbol{\Sigma}$
$\operatorname{Bern}(p)$	Bernoulli	Success probability $p$
$\operatorname{Bin}(n, p)$	Binomial	Number of trials $n$, success probability $p$
$\operatorname{Cat}(\boldsymbol{\pi})$	Categorical	Probability vector $\boldsymbol{\pi}$
$\operatorname{Multi}(n, \boldsymbol{\pi})$	Multinomial	Count $n$, probability vector $\boldsymbol{\pi}$
$\operatorname{Pois}(\lambda)$	Poisson	Rate $\lambda$
$\operatorname{Exp}(\lambda)$	Exponential	Rate $\lambda$
$\operatorname{Gamma}(\alpha, \beta)$	Gamma	Shape $\alpha$, rate $\beta$
$\operatorname{Beta}(\alpha, \beta)$	Beta	Shape parameters $\alpha, \beta$
$\operatorname{Dir}(\boldsymbol{\alpha})$	Dirichlet	Concentration $\boldsymbol{\alpha}$
$\operatorname{Unif}(a, b)$	Uniform	Endpoints $a, b$
$\operatorname{Laplace}(\mu, b)$	Laplace	Location $\mu$, scale $b$
$t_\nu$	Student's $t$	Degrees of freedom $\nu$
$\chi^2_k$	Chi-squared	Degrees of freedom $k$

Bayesian Inference

Symbol	Meaning
$p(\theta)$	Prior distribution over parameters
$p(\mathcal{D} \mid \theta)$ or $\mathcal{L}(\theta; \mathcal{D})$	Likelihood function
$p(\theta \mid \mathcal{D})$	Posterior distribution
$p(\mathcal{D}) = \int p(\mathcal{D} \mid \theta) p(\theta) \, d\theta$	Marginal likelihood (evidence)
$q(\theta)$	Variational approximation to the posterior
$\hat{\theta}_{\text{MAP}}$	Maximum a posteriori estimate: $\arg\max_\theta p(\theta \mid \mathcal{D})$
$\hat{\theta}_{\text{MLE}}$	Maximum likelihood estimate: $\arg\max_\theta p(\mathcal{D} \mid \theta)$

A.6 Information Theory

Symbol	Definition	Notes
$H(X)$	$-\sum_x p(x) \log p(x)$	Shannon entropy (discrete); use $\log_2$ for bits, $\ln$ for nats
$h(X)$	$-\int p(x) \log p(x) \, dx$	Differential entropy (continuous)
$H(X \mid Y)$	$-\sum_{x,y} p(x,y) \log p(x \mid y)$	Conditional entropy
$D_{\text{KL}}(p \,\\|\, q)$	$\sum_x p(x) \log \frac{p(x)}{q(x)}$	Kullback--Leibler divergence; $\geq 0$, not symmetric
$I(X; Y)$	$D_{\text{KL}}\bigl(p(x,y) \,\\|\, p(x)p(y)\bigr)$	Mutual information: $H(X) - H(X \mid Y)$
$H_{\text{cross}}(p, q)$	$-\sum_x p(x) \log q(x)$	Cross-entropy; equals $H(p) + D_{\text{KL}}(p \,\\|\, q)$
$\text{ELBO}$	$\mathbb{E}_{q}[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}_{q}[\log q(\mathbf{z})]$	Evidence lower bound (variational inference)

Key identity. Cross-entropy loss for classification is the empirical cross-entropy between the true label distribution $p$ and the model's predicted distribution $q$:

$$ \mathcal{L}_{\text{CE}} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_c^{(i)} \log \hat{y}_c^{(i)} $$

A.7 Optimization

Objective Functions

Symbol	Meaning
$\mathcal{L}(\theta)$ or $J(\theta)$	Loss (objective) function to be minimized
$\mathcal{R}(\theta)$ or $\Omega(\theta)$	Regularization term
$\lambda$	Regularization strength (hyperparameter)
$\arg\min_\theta f(\theta)$	Value of $\theta$ that minimizes $f$
$\arg\max_\theta f(\theta)$	Value of $\theta$ that maximizes $f$

Common Loss Functions

Name	Formula
Mean squared error (MSE)	$\frac{1}{N}\sum_{i=1}^{N}(y^{(i)} - \hat{y}^{(i)})^2$
Binary cross-entropy	$-\frac{1}{N}\sum_{i=1}^{N}\bigl[y^{(i)}\log\hat{y}^{(i)} + (1 - y^{(i)})\log(1 - \hat{y}^{(i)})\bigr]$
Categorical cross-entropy	$-\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_c^{(i)} \log \hat{y}_c^{(i)}$
Huber loss	$\begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } \lvert y - \hat{y} \rvert \leq \delta \\ \delta(\lvert y - \hat{y}\rvert - \frac{1}{2}\delta) & \text{otherwise}\end{cases}$
$\ell_2$ regularization	$\frac{\lambda}{2}\lVert \theta \rVert_2^2$
$\ell_1$ regularization	$\lambda \lVert \theta \rVert_1$
Elastic net	$\lambda_1 \lVert \theta \rVert_1 + \frac{\lambda_2}{2} \lVert \theta \rVert_2^2$

Gradient Descent Variants

Update Rule	Formula
SGD	$\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)$
SGD with momentum	$\mathbf{v}_{t+1} = \beta \mathbf{v}_t + \nabla \mathcal{L}(\theta_t); \quad \theta_{t+1} = \theta_t - \eta \mathbf{v}_{t+1}$
Adam	$\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1)\mathbf{g}_t; \quad \mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2)\mathbf{g}_t^2; \quad \theta_t = \theta_{t-1} - \eta \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon}$

where $\eta$ is the learning rate, $\mathbf{g}_t = \nabla_\theta \mathcal{L}(\theta_t)$, and $\hat{\mathbf{m}}_t, \hat{\mathbf{v}}_t$ are bias-corrected estimates.

Constrained Optimization

Symbol	Meaning
$\min_\theta f(\theta) \text{ s.t. } g_i(\theta) \leq 0$	Inequality-constrained problem
$\min_\theta f(\theta) \text{ s.t. } h_j(\theta) = 0$	Equality-constrained problem
$\mathcal{L}(\theta, \boldsymbol{\lambda}, \boldsymbol{\nu})$	Lagrangian: $f(\theta) + \sum_i \lambda_i g_i(\theta) + \sum_j \nu_j h_j(\theta)$
$\lambda_i \geq 0$	Lagrange multiplier (inequality constraint)
$\nu_j$	Lagrange multiplier (equality constraint; unconstrained in sign)
KKT conditions	$\nabla_\theta \mathcal{L} = 0$, $\lambda_i g_i(\theta) = 0$, $g_i(\theta) \leq 0$, $\lambda_i \geq 0$

A.8 Causal Inference

Potential Outcomes Framework (Rubin)

Symbol	Meaning
$Y_i(1)$	Potential outcome for unit $i$ under treatment
$Y_i(0)$	Potential outcome for unit $i$ under control
$T_i$ or $W_i$	Treatment indicator ($1$ = treated, $0$ = control)
$Y_i^{\text{obs}} = T_i Y_i(1) + (1 - T_i) Y_i(0)$	Observed outcome (switching equation)
$\tau_i = Y_i(1) - Y_i(0)$	Individual treatment effect (never observed for both)
$\text{ATE} = \mathbb{E}[Y(1) - Y(0)]$	Average treatment effect
$\text{ATT} = \mathbb{E}[Y(1) - Y(0) \mid T = 1]$	Average treatment effect on the treated
$\text{CATE}(\mathbf{x}) = \mathbb{E}[Y(1) - Y(0) \mid \mathbf{X} = \mathbf{x}]$	Conditional average treatment effect
$\text{LATE}$	Local average treatment effect (complier ATE in IV)
$e(\mathbf{x}) = P(T = 1 \mid \mathbf{X} = \mathbf{x})$	Propensity score
$(Y(0), Y(1)) \perp\!\!\!\perp T \mid \mathbf{X}$	Unconfoundedness (ignorability, selection on observables)
$0 < e(\mathbf{x}) < 1$	Overlap (positivity) assumption
SUTVA	Stable unit treatment value assumption (no interference, no hidden variations of treatment)

Structural Causal Models (Pearl)

Symbol	Meaning
$\mathcal{G} = (\mathbf{V}, \mathbf{E})$	Directed acyclic graph (DAG): vertices $\mathbf{V}$, edges $\mathbf{E}$
$\operatorname{Pa}(X)$	Parents of node $X$ in $\mathcal{G}$
$\operatorname{De}(X)$	Descendants of $X$ in $\mathcal{G}$
$X_j := f_j(\operatorname{Pa}(X_j), U_j)$	Structural equation for variable $X_j$
$U_j$	Exogenous (noise) variable for $X_j$
$do(X = x)$	Intervention: set $X$ to value $x$, severing incoming edges
$P(Y \mid do(X = x))$	Interventional distribution (causal effect of $X$ on $Y$)
$P(Y_x = y)$	Counterfactual: probability that $Y$ would be $y$ had $X$ been $x$
$X \to Y$	Direct edge (direct causal relationship) in DAG
$X \leftarrow Z \rightarrow Y$	Fork (common cause / confounder $Z$)
$X \rightarrow Z \rightarrow Y$	Chain (mediator $Z$)
$X \rightarrow Z \leftarrow Y$	Collider at $Z$
$X \perp_d Y \mid \mathbf{Z}$	$d$-separation: $\mathbf{Z}$ blocks all paths between $X$ and $Y$

Key Identification Formulas

Backdoor adjustment (when $\mathbf{Z}$ satisfies the backdoor criterion relative to $(X, Y)$):

$$ P(Y \mid do(X = x)) = \sum_{\mathbf{z}} P(Y \mid X = x, \mathbf{Z} = \mathbf{z}) \, P(\mathbf{Z} = \mathbf{z}) $$

Frontdoor adjustment (when $M$ satisfies the frontdoor criterion):

$$ P(Y \mid do(X = x)) = \sum_m P(M = m \mid X = x) \sum_{x'} P(Y \mid M = m, X = x') P(X = x') $$

Inverse probability weighting (IPW):

$$ \hat{\tau}_{\text{IPW}} = \frac{1}{N} \sum_{i=1}^{N} \left[\frac{T_i Y_i}{e(\mathbf{X}_i)} - \frac{(1 - T_i) Y_i}{1 - e(\mathbf{X}_i)}\right] $$

Doubly robust (augmented IPW):

$$ \hat{\tau}_{\text{DR}} = \frac{1}{N} \sum_{i=1}^{N} \left[\hat{\mu}_1(\mathbf{X}_i) - \hat{\mu}_0(\mathbf{X}_i) + \frac{T_i(Y_i - \hat{\mu}_1(\mathbf{X}_i))}{e(\mathbf{X}_i)} - \frac{(1-T_i)(Y_i - \hat{\mu}_0(\mathbf{X}_i))}{1 - e(\mathbf{X}_i)}\right] $$

A.9 Deep Learning Notation

Symbol	Meaning
$\mathbf{x}^{(i)} \in \mathbb{R}^d$	Input feature vector for sample $i$
$y^{(i)}$	Target label for sample $i$
$\mathbf{W}^{[l]} \in \mathbb{R}^{n_l \times n_{l-1}}$	Weight matrix for layer $l$
$\mathbf{b}^{[l]} \in \mathbb{R}^{n_l}$	Bias vector for layer $l$
$\mathbf{z}^{[l]} = \mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}$	Pre-activation at layer $l$
$\mathbf{a}^{[l]} = \sigma(\mathbf{z}^{[l]})$	Post-activation at layer $l$
$\sigma(\cdot)$	Activation function (context-dependent: ReLU, sigmoid, tanh, etc.)
$\operatorname{softmax}(\mathbf{z})_c = \frac{e^{z_c}}{\sum_{k} e^{z_k}}$	Softmax function
$\theta$	All trainable parameters collectively
$B$	Mini-batch size
$\eta$ or $\alpha$	Learning rate
$\operatorname{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V})$	$\operatorname{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V}$
$d_k$	Dimension of key vectors
$\mathbf{Q}, \mathbf{K}, \mathbf{V}$	Query, key, and value matrices
$\operatorname{BN}(\mathbf{z})$	Batch normalization
$\operatorname{LN}(\mathbf{z})$	Layer normalization
$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}}$	Gradient of loss with respect to weights at layer $l$ (computed via backprop)

A.10 Commonly Used Activation Functions

Name	Formula	Range
Sigmoid	$\sigma(z) = \frac{1}{1 + e^{-z}}$	$(0, 1)$
Tanh	$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$	$(-1, 1)$
ReLU	$\operatorname{ReLU}(z) = \max(0, z)$	$[0, \infty)$
Leaky ReLU	$\max(\alpha z, z)$, $\alpha \ll 1$	$(-\infty, \infty)$
GELU	$z \cdot \Phi(z)$ where $\Phi$ is the standard normal CDF	$\approx (-0.17, \infty)$
SiLU / Swish	$z \cdot \sigma(z)$	$\approx (-0.28, \infty)$

A.11 Notation Index

For quick lookup, symbols are listed alphabetically by their LaTeX command name.

Symbol	Section	Meaning
$\text{ATE}$	A.8	Average treatment effect
$\text{CATE}(\mathbf{x})$	A.8	Conditional average treatment effect
$D_{\text{KL}}$	A.6	Kullback--Leibler divergence
$do(\cdot)$	A.8	Intervention operator (Pearl)
$e(\mathbf{x})$	A.8	Propensity score
$\text{ELBO}$	A.6	Evidence lower bound
$H(X)$	A.6	Shannon entropy
$\mathbf{H}$	A.4	Hessian matrix
$I(X; Y)$	A.6	Mutual information
$\mathbf{J}$	A.4	Jacobian matrix
$\mathcal{L}$	A.7	Loss function or Lagrangian (context-dependent)
$\nabla$	A.4	Gradient operator
$\boldsymbol{\Sigma}$	A.3 / A.5	Diagonal matrix of singular values / covariance matrix
$Y(0), Y(1)$	A.8	Potential outcomes

When notation appears ambiguous in the main text, the intended meaning is always clarified locally. This appendix serves as the canonical reference for the default interpretation of each symbol.