Appendix A: Mathematical Notation Reference
A unified guide to every symbol, convention, and notational pattern used in this textbook.
This appendix consolidates the mathematical notation used across all 39 chapters into a single reference. When a symbol carries different meanings in different subfields (e.g., $p$ for probability density vs. $p$ for a prime number), context is noted explicitly. Page references point to where each symbol is first introduced.
A.1 General Conventions
| Convention | Meaning |
|---|---|
| Lowercase italic ($x, y, \alpha$) | Scalars |
| Lowercase bold ($\mathbf{x}, \mathbf{w}, \boldsymbol{\mu}$) | Column vectors |
| Uppercase bold ($\mathbf{A}, \mathbf{W}, \boldsymbol{\Sigma}$) | Matrices |
| Uppercase calligraphic ($\mathcal{D}, \mathcal{L}, \mathcal{X}$) | Sets, spaces, loss functions |
| Uppercase blackboard bold ($\mathbb{R}, \mathbb{E}, \mathbb{P}$) | Number sets, expectation, probability |
| Superscript $(i)$ ($x^{(i)}, \mathbf{x}^{(i)}$) | Index over data points (sample $i$) |
| Subscript $j$ ($x_j, w_j$) | Index over features or dimensions (component $j$) |
| Superscript $[l]$ ($\mathbf{W}^{[l]}$) | Layer index in a neural network |
| Hat ($\hat{y}, \hat{\theta}$) | Estimated or predicted quantity |
| Star ($\theta^*, \mathbf{x}^*$) | Optimal value |
| Bar ($\bar{x}$) | Sample mean |
| Tilde ($\tilde{x}, \tilde{\mathbf{x}}$) | Noisy, corrupted, or alternative version |
| $:=$ | Defined as |
A.2 Sets and Number Systems
| Symbol | Meaning |
|---|---|
| $\mathbb{R}$ | Real numbers |
| $\mathbb{R}^n$ | $n$-dimensional real vector space |
| $\mathbb{R}^{m \times n}$ | Space of $m \times n$ real matrices |
| $\mathbb{R}_{\geq 0}$ | Non-negative reals |
| $\mathbb{Z}$ | Integers |
| $\mathbb{N}$ | Natural numbers $\{0, 1, 2, \ldots\}$ |
| $\emptyset$ | Empty set |
| $\lvert S \rvert$ | Cardinality of set $S$ |
| $S^c$ | Complement of set $S$ |
| $\in, \notin$ | Element of, not an element of |
| $\subset, \subseteq$ | Proper subset, subset |
| $\cup, \cap$ | Union, intersection |
| $\setminus$ | Set difference: $A \setminus B = \{x \in A : x \notin B\}$ |
| $\times$ | Cartesian product |
| $2^S$ | Power set of $S$ |
A.3 Linear Algebra
Vectors
| Symbol | Meaning |
|---|---|
| $\mathbf{x} \in \mathbb{R}^n$ | Column vector with $n$ components |
| $\mathbf{x}^\top$ | Row vector (transpose of $\mathbf{x}$) |
| $x_j$ or $[\mathbf{x}]_j$ | $j$-th component of $\mathbf{x}$ |
| $\mathbf{e}_j$ | Standard basis vector ($1$ in position $j$, $0$ elsewhere) |
| $\mathbf{0}$ | Zero vector |
| $\mathbf{1}$ | All-ones vector |
| $\mathbf{x} \cdot \mathbf{y}$ or $\mathbf{x}^\top \mathbf{y}$ | Dot product $\sum_j x_j y_j$ |
| $\mathbf{x} \odot \mathbf{y}$ | Hadamard (element-wise) product |
| $\lVert \mathbf{x} \rVert_2$ | Euclidean ($\ell_2$) norm: $\sqrt{\sum_j x_j^2}$ |
| $\lVert \mathbf{x} \rVert_1$ | $\ell_1$ norm: $\sum_j \lvert x_j \rvert$ |
| $\lVert \mathbf{x} \rVert_\infty$ | $\ell_\infty$ norm: $\max_j \lvert x_j \rvert$ |
| $\lVert \mathbf{x} \rVert_p$ | $\ell_p$ norm: $\left(\sum_j \lvert x_j \rvert^p\right)^{1/p}$ |
Matrices
| Symbol | Meaning |
|---|---|
| $\mathbf{A} \in \mathbb{R}^{m \times n}$ | Matrix with $m$ rows and $n$ columns |
| $A_{ij}$ or $[\mathbf{A}]_{ij}$ | Entry in row $i$, column $j$ |
| $\mathbf{A}^\top$ | Transpose: $[\mathbf{A}^\top]_{ij} = A_{ji}$ |
| $\mathbf{A}^{-1}$ | Matrix inverse (when $\mathbf{A}$ is square and non-singular) |
| $\mathbf{A}^{\dagger}$ | Moore--Penrose pseudoinverse |
| $\mathbf{I}$ or $\mathbf{I}_n$ | $n \times n$ identity matrix |
| $\operatorname{diag}(\mathbf{x})$ | Diagonal matrix with $\mathbf{x}$ on its diagonal |
| $\operatorname{tr}(\mathbf{A})$ | Trace: $\sum_i A_{ii}$ |
| $\det(\mathbf{A})$ or $\lvert \mathbf{A} \rvert$ | Determinant |
| $\operatorname{rank}(\mathbf{A})$ | Rank |
| $\lVert \mathbf{A} \rVert_F$ | Frobenius norm: $\sqrt{\sum_{i,j} A_{ij}^2}$ |
| $\lVert \mathbf{A} \rVert_2$ | Spectral norm: largest singular value |
| $\lVert \mathbf{A} \rVert_*$ | Nuclear norm: sum of singular values |
| $\mathbf{A} \succeq 0$ | $\mathbf{A}$ is positive semidefinite |
| $\mathbf{A} \succ 0$ | $\mathbf{A}$ is positive definite |
| $\otimes$ | Kronecker product |
Decompositions
| Decomposition | Form | Notes |
|---|---|---|
| Eigendecomposition | $\mathbf{A} = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^{-1}$ | $\boldsymbol{\Lambda} = \operatorname{diag}(\lambda_1, \ldots, \lambda_n)$; requires $\mathbf{A}$ square |
| Symmetric eigen | $\mathbf{A} = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^\top$ | $\mathbf{Q}$ orthogonal when $\mathbf{A} = \mathbf{A}^\top$ |
| SVD | $\mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top$ | $\mathbf{U} \in \mathbb{R}^{m \times m}$, $\boldsymbol{\Sigma} \in \mathbb{R}^{m \times n}$, $\mathbf{V} \in \mathbb{R}^{n \times n}$; singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$ |
| Truncated SVD | $\mathbf{A} \approx \mathbf{U}_k \boldsymbol{\Sigma}_k \mathbf{V}_k^\top$ | Best rank-$k$ approximation (Eckart--Young theorem) |
| Cholesky | $\mathbf{A} = \mathbf{L}\mathbf{L}^\top$ | $\mathbf{L}$ lower triangular; requires $\mathbf{A} \succ 0$ |
| QR | $\mathbf{A} = \mathbf{Q}\mathbf{R}$ | $\mathbf{Q}$ orthogonal, $\mathbf{R}$ upper triangular |
| LU | $\mathbf{A} = \mathbf{L}\mathbf{U}$ | $\mathbf{L}$ lower triangular, $\mathbf{U}$ upper triangular (with pivoting) |
A.4 Calculus and Analysis
Derivatives
| Symbol | Meaning |
|---|---|
| $\frac{df}{dx}$ or $f'(x)$ | Derivative of scalar function $f$ with respect to scalar $x$ |
| $\frac{\partial f}{\partial x_j}$ | Partial derivative of $f$ with respect to $x_j$ |
| $\nabla f(\mathbf{x})$ or $\nabla_{\mathbf{x}} f$ | Gradient: $\left[\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right]^\top \in \mathbb{R}^n$ |
| $\nabla^2 f(\mathbf{x})$ or $\mathbf{H}$ | Hessian: $[\mathbf{H}]_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} \in \mathbb{R}^{n \times n}$ |
| $\mathbf{J}$ or $\frac{\partial \mathbf{f}}{\partial \mathbf{x}}$ | Jacobian of $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$: $[\mathbf{J}]_{ij} = \frac{\partial f_i}{\partial x_j} \in \mathbb{R}^{m \times n}$ |
| $\frac{\partial \mathcal{L}}{\partial \mathbf{W}}$ | Matrix derivative: $\left[\frac{\partial \mathcal{L}}{\partial W_{ij}}\right] \in \mathbb{R}^{m \times n}$ |
| $\nabla_{\mathbf{x}} f \big\rvert_{\mathbf{x}=\mathbf{a}}$ | Gradient evaluated at $\mathbf{x} = \mathbf{a}$ |
Common Identities Used in This Textbook
$$ \nabla_{\mathbf{x}} (\mathbf{a}^\top \mathbf{x}) = \mathbf{a} $$
$$ \nabla_{\mathbf{x}} (\mathbf{x}^\top \mathbf{A} \mathbf{x}) = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x} $$
$$ \nabla_{\mathbf{x}} \lVert \mathbf{x} \rVert_2^2 = 2\mathbf{x} $$
$$ \frac{\partial}{\partial \mathbf{X}} \operatorname{tr}(\mathbf{A}\mathbf{X}\mathbf{B}) = \mathbf{A}^\top \mathbf{B}^\top $$
Chain Rule (Multivariable)
For $f: \mathbb{R}^n \to \mathbb{R}$ and $\mathbf{g}: \mathbb{R}^m \to \mathbb{R}^n$:
$$ \frac{\partial f}{\partial \mathbf{x}} = \frac{\partial f}{\partial \mathbf{g}} \cdot \frac{\partial \mathbf{g}}{\partial \mathbf{x}} = \mathbf{J}_{\mathbf{g}}^\top \nabla_{\mathbf{g}} f $$
This is the mathematical foundation of backpropagation (Chapters 4--5).
A.5 Probability and Statistics
Core Notation
| Symbol | Meaning |
|---|---|
| $\Omega$ | Sample space |
| $P(A)$ or $\mathbb{P}(A)$ | Probability of event $A$ |
| $p(x)$ | Probability density function (continuous) or probability mass function (discrete) |
| $p(x, y)$ | Joint distribution of $x$ and $y$ |
| $p(x \mid y)$ | Conditional distribution of $x$ given $y$ |
| $p(x; \theta)$ or $p_\theta(x)$ | Distribution of $x$ parameterized by $\theta$ |
| $X \sim p$ | Random variable $X$ distributed according to $p$ |
| $X \perp\!\!\!\perp Y$ | $X$ and $Y$ are independent |
| $X \perp\!\!\!\perp Y \mid Z$ | $X$ and $Y$ are conditionally independent given $Z$ |
| $F(x) = P(X \leq x)$ | Cumulative distribution function (CDF) |
| $F^{-1}(q)$ | Quantile function (inverse CDF) |
Expectations and Moments
| Symbol | Meaning |
|---|---|
| $\mathbb{E}[X]$ or $\mathbb{E}_{p}[X]$ | Expected value of $X$ under distribution $p$ |
| $\mathbb{E}[X \mid Y]$ | Conditional expectation |
| $\operatorname{Var}(X)$ | Variance: $\mathbb{E}[(X - \mathbb{E}[X])^2]$ |
| $\operatorname{Std}(X)$ or $\sigma_X$ | Standard deviation: $\sqrt{\operatorname{Var}(X)}$ |
| $\operatorname{Cov}(X, Y)$ | Covariance: $\mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]$ |
| $\operatorname{Corr}(X, Y)$ or $\rho_{XY}$ | Correlation: $\frac{\operatorname{Cov}(X,Y)}{\sigma_X \sigma_Y}$ |
| $\boldsymbol{\Sigma}$ | Covariance matrix: $[\boldsymbol{\Sigma}]_{ij} = \operatorname{Cov}(X_i, X_j)$ |
Named Distributions
| Notation | Distribution | Parameters |
|---|---|---|
| $\mathcal{N}(\mu, \sigma^2)$ | Univariate normal | Mean $\mu$, variance $\sigma^2$ |
| $\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ | Multivariate normal | Mean vector $\boldsymbol{\mu}$, covariance matrix $\boldsymbol{\Sigma}$ |
| $\operatorname{Bern}(p)$ | Bernoulli | Success probability $p$ |
| $\operatorname{Bin}(n, p)$ | Binomial | Number of trials $n$, success probability $p$ |
| $\operatorname{Cat}(\boldsymbol{\pi})$ | Categorical | Probability vector $\boldsymbol{\pi}$ |
| $\operatorname{Multi}(n, \boldsymbol{\pi})$ | Multinomial | Count $n$, probability vector $\boldsymbol{\pi}$ |
| $\operatorname{Pois}(\lambda)$ | Poisson | Rate $\lambda$ |
| $\operatorname{Exp}(\lambda)$ | Exponential | Rate $\lambda$ |
| $\operatorname{Gamma}(\alpha, \beta)$ | Gamma | Shape $\alpha$, rate $\beta$ |
| $\operatorname{Beta}(\alpha, \beta)$ | Beta | Shape parameters $\alpha, \beta$ |
| $\operatorname{Dir}(\boldsymbol{\alpha})$ | Dirichlet | Concentration $\boldsymbol{\alpha}$ |
| $\operatorname{Unif}(a, b)$ | Uniform | Endpoints $a, b$ |
| $\operatorname{Laplace}(\mu, b)$ | Laplace | Location $\mu$, scale $b$ |
| $t_\nu$ | Student's $t$ | Degrees of freedom $\nu$ |
| $\chi^2_k$ | Chi-squared | Degrees of freedom $k$ |
Bayesian Inference
| Symbol | Meaning |
|---|---|
| $p(\theta)$ | Prior distribution over parameters |
| $p(\mathcal{D} \mid \theta)$ or $\mathcal{L}(\theta; \mathcal{D})$ | Likelihood function |
| $p(\theta \mid \mathcal{D})$ | Posterior distribution |
| $p(\mathcal{D}) = \int p(\mathcal{D} \mid \theta) p(\theta) \, d\theta$ | Marginal likelihood (evidence) |
| $q(\theta)$ | Variational approximation to the posterior |
| $\hat{\theta}_{\text{MAP}}$ | Maximum a posteriori estimate: $\arg\max_\theta p(\theta \mid \mathcal{D})$ |
| $\hat{\theta}_{\text{MLE}}$ | Maximum likelihood estimate: $\arg\max_\theta p(\mathcal{D} \mid \theta)$ |
A.6 Information Theory
| Symbol | Definition | Notes |
|---|---|---|
| $H(X)$ | $-\sum_x p(x) \log p(x)$ | Shannon entropy (discrete); use $\log_2$ for bits, $\ln$ for nats |
| $h(X)$ | $-\int p(x) \log p(x) \, dx$ | Differential entropy (continuous) |
| $H(X \mid Y)$ | $-\sum_{x,y} p(x,y) \log p(x \mid y)$ | Conditional entropy |
| $D_{\text{KL}}(p \,\|\, q)$ | $\sum_x p(x) \log \frac{p(x)}{q(x)}$ | Kullback--Leibler divergence; $\geq 0$, not symmetric |
| $I(X; Y)$ | $D_{\text{KL}}\bigl(p(x,y) \,\|\, p(x)p(y)\bigr)$ | Mutual information: $H(X) - H(X \mid Y)$ |
| $H_{\text{cross}}(p, q)$ | $-\sum_x p(x) \log q(x)$ | Cross-entropy; equals $H(p) + D_{\text{KL}}(p \,\|\, q)$ |
| $\text{ELBO}$ | $\mathbb{E}_{q}[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}_{q}[\log q(\mathbf{z})]$ | Evidence lower bound (variational inference) |
Key identity. Cross-entropy loss for classification is the empirical cross-entropy between the true label distribution $p$ and the model's predicted distribution $q$:
$$ \mathcal{L}_{\text{CE}} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_c^{(i)} \log \hat{y}_c^{(i)} $$
A.7 Optimization
Objective Functions
| Symbol | Meaning |
|---|---|
| $\mathcal{L}(\theta)$ or $J(\theta)$ | Loss (objective) function to be minimized |
| $\mathcal{R}(\theta)$ or $\Omega(\theta)$ | Regularization term |
| $\lambda$ | Regularization strength (hyperparameter) |
| $\arg\min_\theta f(\theta)$ | Value of $\theta$ that minimizes $f$ |
| $\arg\max_\theta f(\theta)$ | Value of $\theta$ that maximizes $f$ |
Common Loss Functions
| Name | Formula |
|---|---|
| Mean squared error (MSE) | $\frac{1}{N}\sum_{i=1}^{N}(y^{(i)} - \hat{y}^{(i)})^2$ |
| Binary cross-entropy | $-\frac{1}{N}\sum_{i=1}^{N}\bigl[y^{(i)}\log\hat{y}^{(i)} + (1 - y^{(i)})\log(1 - \hat{y}^{(i)})\bigr]$ |
| Categorical cross-entropy | $-\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C} y_c^{(i)} \log \hat{y}_c^{(i)}$ |
| Huber loss | $\begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } \lvert y - \hat{y} \rvert \leq \delta \\ \delta(\lvert y - \hat{y}\rvert - \frac{1}{2}\delta) & \text{otherwise}\end{cases}$ |
| $\ell_2$ regularization | $\frac{\lambda}{2}\lVert \theta \rVert_2^2$ |
| $\ell_1$ regularization | $\lambda \lVert \theta \rVert_1$ |
| Elastic net | $\lambda_1 \lVert \theta \rVert_1 + \frac{\lambda_2}{2} \lVert \theta \rVert_2^2$ |
Gradient Descent Variants
| Update Rule | Formula |
|---|---|
| SGD | $\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)$ |
| SGD with momentum | $\mathbf{v}_{t+1} = \beta \mathbf{v}_t + \nabla \mathcal{L}(\theta_t); \quad \theta_{t+1} = \theta_t - \eta \mathbf{v}_{t+1}$ |
| Adam | $\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1)\mathbf{g}_t; \quad \mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2)\mathbf{g}_t^2; \quad \theta_t = \theta_{t-1} - \eta \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon}$ |
where $\eta$ is the learning rate, $\mathbf{g}_t = \nabla_\theta \mathcal{L}(\theta_t)$, and $\hat{\mathbf{m}}_t, \hat{\mathbf{v}}_t$ are bias-corrected estimates.
Constrained Optimization
| Symbol | Meaning |
|---|---|
| $\min_\theta f(\theta) \text{ s.t. } g_i(\theta) \leq 0$ | Inequality-constrained problem |
| $\min_\theta f(\theta) \text{ s.t. } h_j(\theta) = 0$ | Equality-constrained problem |
| $\mathcal{L}(\theta, \boldsymbol{\lambda}, \boldsymbol{\nu})$ | Lagrangian: $f(\theta) + \sum_i \lambda_i g_i(\theta) + \sum_j \nu_j h_j(\theta)$ |
| $\lambda_i \geq 0$ | Lagrange multiplier (inequality constraint) |
| $\nu_j$ | Lagrange multiplier (equality constraint; unconstrained in sign) |
| KKT conditions | $\nabla_\theta \mathcal{L} = 0$, $\lambda_i g_i(\theta) = 0$, $g_i(\theta) \leq 0$, $\lambda_i \geq 0$ |
A.8 Causal Inference
Potential Outcomes Framework (Rubin)
| Symbol | Meaning |
|---|---|
| $Y_i(1)$ | Potential outcome for unit $i$ under treatment |
| $Y_i(0)$ | Potential outcome for unit $i$ under control |
| $T_i$ or $W_i$ | Treatment indicator ($1$ = treated, $0$ = control) |
| $Y_i^{\text{obs}} = T_i Y_i(1) + (1 - T_i) Y_i(0)$ | Observed outcome (switching equation) |
| $\tau_i = Y_i(1) - Y_i(0)$ | Individual treatment effect (never observed for both) |
| $\text{ATE} = \mathbb{E}[Y(1) - Y(0)]$ | Average treatment effect |
| $\text{ATT} = \mathbb{E}[Y(1) - Y(0) \mid T = 1]$ | Average treatment effect on the treated |
| $\text{CATE}(\mathbf{x}) = \mathbb{E}[Y(1) - Y(0) \mid \mathbf{X} = \mathbf{x}]$ | Conditional average treatment effect |
| $\text{LATE}$ | Local average treatment effect (complier ATE in IV) |
| $e(\mathbf{x}) = P(T = 1 \mid \mathbf{X} = \mathbf{x})$ | Propensity score |
| $(Y(0), Y(1)) \perp\!\!\!\perp T \mid \mathbf{X}$ | Unconfoundedness (ignorability, selection on observables) |
| $0 < e(\mathbf{x}) < 1$ | Overlap (positivity) assumption |
| SUTVA | Stable unit treatment value assumption (no interference, no hidden variations of treatment) |
Structural Causal Models (Pearl)
| Symbol | Meaning |
|---|---|
| $\mathcal{G} = (\mathbf{V}, \mathbf{E})$ | Directed acyclic graph (DAG): vertices $\mathbf{V}$, edges $\mathbf{E}$ |
| $\operatorname{Pa}(X)$ | Parents of node $X$ in $\mathcal{G}$ |
| $\operatorname{De}(X)$ | Descendants of $X$ in $\mathcal{G}$ |
| $X_j := f_j(\operatorname{Pa}(X_j), U_j)$ | Structural equation for variable $X_j$ |
| $U_j$ | Exogenous (noise) variable for $X_j$ |
| $do(X = x)$ | Intervention: set $X$ to value $x$, severing incoming edges |
| $P(Y \mid do(X = x))$ | Interventional distribution (causal effect of $X$ on $Y$) |
| $P(Y_x = y)$ | Counterfactual: probability that $Y$ would be $y$ had $X$ been $x$ |
| $X \to Y$ | Direct edge (direct causal relationship) in DAG |
| $X \leftarrow Z \rightarrow Y$ | Fork (common cause / confounder $Z$) |
| $X \rightarrow Z \rightarrow Y$ | Chain (mediator $Z$) |
| $X \rightarrow Z \leftarrow Y$ | Collider at $Z$ |
| $X \perp_d Y \mid \mathbf{Z}$ | $d$-separation: $\mathbf{Z}$ blocks all paths between $X$ and $Y$ |
Key Identification Formulas
Backdoor adjustment (when $\mathbf{Z}$ satisfies the backdoor criterion relative to $(X, Y)$):
$$ P(Y \mid do(X = x)) = \sum_{\mathbf{z}} P(Y \mid X = x, \mathbf{Z} = \mathbf{z}) \, P(\mathbf{Z} = \mathbf{z}) $$
Frontdoor adjustment (when $M$ satisfies the frontdoor criterion):
$$ P(Y \mid do(X = x)) = \sum_m P(M = m \mid X = x) \sum_{x'} P(Y \mid M = m, X = x') P(X = x') $$
Inverse probability weighting (IPW):
$$ \hat{\tau}_{\text{IPW}} = \frac{1}{N} \sum_{i=1}^{N} \left[\frac{T_i Y_i}{e(\mathbf{X}_i)} - \frac{(1 - T_i) Y_i}{1 - e(\mathbf{X}_i)}\right] $$
Doubly robust (augmented IPW):
$$ \hat{\tau}_{\text{DR}} = \frac{1}{N} \sum_{i=1}^{N} \left[\hat{\mu}_1(\mathbf{X}_i) - \hat{\mu}_0(\mathbf{X}_i) + \frac{T_i(Y_i - \hat{\mu}_1(\mathbf{X}_i))}{e(\mathbf{X}_i)} - \frac{(1-T_i)(Y_i - \hat{\mu}_0(\mathbf{X}_i))}{1 - e(\mathbf{X}_i)}\right] $$
A.9 Deep Learning Notation
| Symbol | Meaning |
|---|---|
| $\mathbf{x}^{(i)} \in \mathbb{R}^d$ | Input feature vector for sample $i$ |
| $y^{(i)}$ | Target label for sample $i$ |
| $\mathbf{W}^{[l]} \in \mathbb{R}^{n_l \times n_{l-1}}$ | Weight matrix for layer $l$ |
| $\mathbf{b}^{[l]} \in \mathbb{R}^{n_l}$ | Bias vector for layer $l$ |
| $\mathbf{z}^{[l]} = \mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}$ | Pre-activation at layer $l$ |
| $\mathbf{a}^{[l]} = \sigma(\mathbf{z}^{[l]})$ | Post-activation at layer $l$ |
| $\sigma(\cdot)$ | Activation function (context-dependent: ReLU, sigmoid, tanh, etc.) |
| $\operatorname{softmax}(\mathbf{z})_c = \frac{e^{z_c}}{\sum_{k} e^{z_k}}$ | Softmax function |
| $\theta$ | All trainable parameters collectively |
| $B$ | Mini-batch size |
| $\eta$ or $\alpha$ | Learning rate |
| $\operatorname{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V})$ | $\operatorname{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V}$ |
| $d_k$ | Dimension of key vectors |
| $\mathbf{Q}, \mathbf{K}, \mathbf{V}$ | Query, key, and value matrices |
| $\operatorname{BN}(\mathbf{z})$ | Batch normalization |
| $\operatorname{LN}(\mathbf{z})$ | Layer normalization |
| $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}}$ | Gradient of loss with respect to weights at layer $l$ (computed via backprop) |
A.10 Commonly Used Activation Functions
| Name | Formula | Range |
|---|---|---|
| Sigmoid | $\sigma(z) = \frac{1}{1 + e^{-z}}$ | $(0, 1)$ |
| Tanh | $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$ | $(-1, 1)$ |
| ReLU | $\operatorname{ReLU}(z) = \max(0, z)$ | $[0, \infty)$ |
| Leaky ReLU | $\max(\alpha z, z)$, $\alpha \ll 1$ | $(-\infty, \infty)$ |
| GELU | $z \cdot \Phi(z)$ where $\Phi$ is the standard normal CDF | $\approx (-0.17, \infty)$ |
| SiLU / Swish | $z \cdot \sigma(z)$ | $\approx (-0.28, \infty)$ |
A.11 Notation Index
For quick lookup, symbols are listed alphabetically by their LaTeX command name.
| Symbol | Section | Meaning |
|---|---|---|
| $\text{ATE}$ | A.8 | Average treatment effect |
| $\text{CATE}(\mathbf{x})$ | A.8 | Conditional average treatment effect |
| $D_{\text{KL}}$ | A.6 | Kullback--Leibler divergence |
| $do(\cdot)$ | A.8 | Intervention operator (Pearl) |
| $e(\mathbf{x})$ | A.8 | Propensity score |
| $\text{ELBO}$ | A.6 | Evidence lower bound |
| $H(X)$ | A.6 | Shannon entropy |
| $\mathbf{H}$ | A.4 | Hessian matrix |
| $I(X; Y)$ | A.6 | Mutual information |
| $\mathbf{J}$ | A.4 | Jacobian matrix |
| $\mathcal{L}$ | A.7 | Loss function or Lagrangian (context-dependent) |
| $\nabla$ | A.4 | Gradient operator |
| $\boldsymbol{\Sigma}$ | A.3 / A.5 | Diagonal matrix of singular values / covariance matrix |
| $Y(0), Y(1)$ | A.8 | Potential outcomes |
When notation appears ambiguous in the main text, the intended meaning is always clarified locally. This appendix serves as the canonical reference for the default interpretation of each symbol.