Appendix A: Mathematical Foundations

This appendix provides a concise review of the mathematical concepts used throughout the book. It is intended as a reference rather than a comprehensive tutorial. Readers seeking deeper treatment should consult the resources listed in Appendix H.

A.1 Notation Guide

Throughout this text, we adopt the following notational conventions. A more detailed symbol-by-symbol reference is provided in Appendix F.

Scalars, Vectors, and Matrices

Scalars are denoted by lowercase italic letters: $x$, $y$, $\theta$.
Vectors are denoted by lowercase bold letters: $\mathbf{x}$, $\mathbf{w}$, $\mathbf{v}$.
Matrices are denoted by uppercase bold letters: $\mathbf{A}$, $\mathbf{X}$, $\mathbf{\Sigma}$.
The $i$-th element of vector $\mathbf{x}$ is written $x_i$.
The element in row $i$ and column $j$ of matrix $\mathbf{A}$ is written $a_{ij}$ or $[\mathbf{A}]_{ij}$.

Sets and Indices

Sets are denoted by calligraphic uppercase letters: $\mathcal{S}$, $\mathcal{T}$, $\mathcal{P}$.
The set of real numbers is $\mathbb{R}$; the set of positive integers is $\mathbb{Z}^+$.
We use $i \in \{1, 2, \ldots, n\}$ to index observations and $j \in \{1, 2, \ldots, p\}$ to index features.

Common Operators

Symbol	Meaning
$\sum_{i=1}^{n}$	Summation over index $i$ from 1 to $n$
$\prod_{i=1}^{n}$	Product over index $i$ from 1 to $n$
$\frac{\partial f}{\partial x}$	Partial derivative of $f$ with respect to $x$
$\nabla f$	Gradient of $f$
$\\|\mathbf{x}\\|$	Euclidean norm of vector $\mathbf{x}$
$\mathbf{x}^\top$	Transpose of vector or matrix
$\mathbb{E}[X]$	Expected value of random variable $X$
$\text{Var}(X)$	Variance of random variable $X$
$P(A)$	Probability of event $A$
$P(A \mid B)$	Conditional probability of $A$ given $B$

Soccer-Specific Conventions

Pitch coordinates: origin at the bottom-left corner of the pitch, $x$-axis running along the length (0 to 120 yards or 0 to 105 meters), $y$-axis running along the width (0 to 80 yards or 0 to 68 meters).
Time: match time $t$ measured in minutes from kickoff, with $t \in [0, 90]$ for regulation time, extended as needed for stoppage time and extra time.
Player $k$: the $k$-th player in a squad or lineup, with $k \in \{1, 2, \ldots, 11\}$ for the starting eleven.
Match $m$: the $m$-th match in a dataset or season.

A.2 Probability Fundamentals

A.2.1 Sample Spaces and Events

A sample space $\Omega$ is the set of all possible outcomes of a random experiment. An event $A$ is a subset of $\Omega$. In soccer analytics, a sample space might be the set of all possible scorelines for a match, or the set of all locations where a shot can originate.

Example: For the number of goals scored by the home team in a match, $\Omega = \{0, 1, 2, 3, \ldots\}$, and the event "home team scores at least 2 goals" is $A = \{2, 3, 4, \ldots\}$.

A.2.2 Axioms of Probability

For any event $A \subseteq \Omega$:

$P(A) \geq 0$ (non-negativity)
$P(\Omega) = 1$ (normalization)
For mutually exclusive events $A_1, A_2, \ldots$: $P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i)$ (countable additivity)

From these axioms, we derive:

$P(A^c) = 1 - P(A)$, where $A^c$ is the complement of $A$.
$P(A \cup B) = P(A) + P(B) - P(A \cap B)$ (inclusion-exclusion).
If $A \subseteq B$, then $P(A) \leq P(B)$ (monotonicity).

A.2.3 Conditional Probability and Bayes' Theorem

The conditional probability of $A$ given $B$ is:

$$P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0$$

Bayes' Theorem allows us to update probabilities as new evidence arrives:

$$P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}$$

More generally, using the law of total probability with a partition $\{A_1, A_2, \ldots, A_k\}$ of $\Omega$:

$$P(A_i \mid B) = \frac{P(B \mid A_i) \, P(A_i)}{\sum_{j=1}^{k} P(B \mid A_j) \, P(A_j)}$$

Application: In Chapter 20, we use Bayes' theorem to update the probability that a team will win the league given their results through matchday $t$. The prior $P(A_i)$ comes from pre-season expectations; the likelihood $P(B \mid A_i)$ quantifies how probable the observed results are under each hypothesis.

A.2.4 Independence

Events $A$ and $B$ are independent if:

$$P(A \cap B) = P(A) \cdot P(B)$$

Equivalently, $P(A \mid B) = P(A)$. Independence is a strong assumption. In soccer, successive shots in a match are generally not independent because game state (score, time remaining, tactical adjustments) influences subsequent events.

A.2.5 Random Variables and Distributions

A random variable $X$ is a function from the sample space to the real numbers: $X: \Omega \to \mathbb{R}$.

Discrete random variables take countable values. The probability mass function (PMF) is:

$$p(x) = P(X = x)$$

Continuous random variables take values in an interval. The probability density function (PDF) $f(x)$ satisfies:

$$P(a \leq X \leq b) = \int_a^b f(x) \, dx$$

The cumulative distribution function (CDF) is:

$$F(x) = P(X \leq x)$$

A.2.6 Expectation and Variance

For a discrete random variable:

$$\mathbb{E}[X] = \sum_x x \cdot p(x)$$

For a continuous random variable:

$$\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx$$

The variance measures spread:

$$\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$

The standard deviation is $\sigma = \sqrt{\text{Var}(X)}$.

Properties of Expectation:

$\mathbb{E}[aX + b] = a\mathbb{E}[X] + b$ (linearity)
$\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]$ (always, even if dependent)
$\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$ only if $X$ and $Y$ are independent

Properties of Variance:

$\text{Var}(aX + b) = a^2 \text{Var}(X)$
$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$

A.2.7 Key Distributions for Soccer Analytics

Poisson Distribution ($X \sim \text{Pois}(\lambda)$):

$$P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots$$

Mean: $\mathbb{E}[X] = \lambda$; Variance: $\text{Var}(X) = \lambda$.
Models the number of goals scored by a team in a match (see Chapter 7).

Binomial Distribution ($X \sim \text{Bin}(n, p)$):

$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \ldots, n$$

Mean: $np$; Variance: $np(1-p)$.
Models the number of successful passes in $n$ attempts with success probability $p$.

Normal (Gaussian) Distribution ($X \sim \mathcal{N}(\mu, \sigma^2)$):

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$$

By the Central Limit Theorem, sums and means of large samples are approximately normal.

Beta Distribution ($X \sim \text{Beta}(\alpha, \beta)$):

$$f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}, \quad x \in [0, 1]$$

Useful for modeling probabilities (e.g., shot conversion rates as in Chapter 7).
Conjugate prior for the binomial likelihood in Bayesian analysis.

Negative Binomial Distribution ($X \sim \text{NegBin}(r, p)$):

$$P(X = k) = \binom{k + r - 1}{k} p^r (1-p)^k, \quad k = 0, 1, 2, \ldots$$

Generalizes the Poisson to allow overdispersion ($\text{Var}(X) > \mathbb{E}[X]$), useful when goals exhibit extra variability beyond what the Poisson predicts.

A.3 Statistics Review

A.3.1 Descriptive Statistics

Given a sample $x_1, x_2, \ldots, x_n$:

Sample mean:

$$\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$$

Sample variance:

$$s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2$$

Sample standard deviation: $s = \sqrt{s^2}$

Median: The middle value when observations are sorted. More robust to outliers than the mean, which is relevant when analyzing player performance metrics that may have extreme values.

Percentiles and Quantiles: The $p$-th percentile $Q(p)$ is the value below which $100p\%$ of observations fall. The interquartile range $\text{IQR} = Q(0.75) - Q(0.25)$ is a robust measure of spread.

A.3.2 Covariance and Correlation

The sample covariance between variables $X$ and $Y$:

$$s_{xy} = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})$$

The Pearson correlation coefficient:

$$r = \frac{s_{xy}}{s_x \cdot s_y}, \quad -1 \leq r \leq 1$$

Correlation measures linear association. In soccer analytics, we frequently examine the correlation between xG and actual goals, or between possession percentage and points earned.

Spearman's rank correlation is a non-parametric alternative based on the ranks of observations, robust to outliers and nonlinear relationships.

A.3.3 Hypothesis Testing

A hypothesis test evaluates evidence against a null hypothesis $H_0$.

State $H_0$ and the alternative $H_1$.
Choose a significance level $\alpha$ (commonly 0.05).
Compute a test statistic from the data.
Determine the $p$-value: the probability of observing a test statistic as extreme as (or more extreme than) the observed value, assuming $H_0$ is true.
Reject $H_0$ if $p \leq \alpha$.

Common tests used in this book:

Test	Purpose	Test Statistic
One-sample $t$-test	Compare a mean to a known value	$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}$
Two-sample $t$-test	Compare two group means	$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s_1^2/n_1 + s_2^2/n_2}}$
Chi-squared test	Test independence in contingency tables	$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$
Likelihood ratio test	Compare nested models	$\Lambda = -2 \ln \frac{\mathcal{L}_0}{\mathcal{L}_1}$

Caution on multiple testing: When testing many hypotheses simultaneously (e.g., evaluating 500 players), the probability of at least one false positive increases dramatically. Corrections such as Bonferroni ($\alpha' = \alpha / m$ for $m$ tests) or the Benjamini-Hochberg procedure for controlling the false discovery rate (FDR) should be applied. See Chapter 15 for a detailed discussion.

A.3.4 Confidence Intervals

A $100(1 - \alpha)\%$ confidence interval for a population mean $\mu$ (large sample):

$$\bar{x} \pm z_{\alpha/2} \cdot \frac{s}{\sqrt{n}}$$

For small samples with unknown variance, replace $z_{\alpha/2}$ with $t_{\alpha/2, n-1}$ from the $t$-distribution.

Interpretation: If we were to repeat the experiment many times and construct a confidence interval each time, approximately $100(1 - \alpha)\%$ of those intervals would contain the true parameter.

A.3.5 Regression Analysis

Simple Linear Regression:

$$y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2)$$

The ordinary least squares (OLS) estimates minimize $\sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2$:

$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$$

Multiple Linear Regression:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$$

The OLS solution is:

$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}$$

Logistic Regression (for binary outcomes such as goal/no-goal):

$$P(Y = 1 \mid \mathbf{x}) = \frac{1}{1 + e^{-(\beta_0 + \boldsymbol{\beta}^\top \mathbf{x})}}$$

The log-odds (logit) is linear in the features:

$$\log \frac{P(Y=1 \mid \mathbf{x})}{1 - P(Y=1 \mid \mathbf{x})} = \beta_0 + \boldsymbol{\beta}^\top \mathbf{x}$$

Parameters are estimated by maximum likelihood (see Section A.6).

A.3.6 Model Evaluation

$R^2$ (Coefficient of Determination):

$$R^2 = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}} = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}$$

Adjusted $R^2$ penalizes for the number of predictors $p$:

$$R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}$$

AIC (Akaike Information Criterion):

$$\text{AIC} = 2k - 2\ln(\hat{\mathcal{L}})$$

where $k$ is the number of parameters and $\hat{\mathcal{L}}$ is the maximized likelihood. Lower AIC indicates better model fit with appropriate complexity penalization.

BIC (Bayesian Information Criterion):

$$\text{BIC} = k \ln(n) - 2\ln(\hat{\mathcal{L}})$$

BIC penalizes model complexity more heavily than AIC for $n \geq 8$.

A.4 Linear Algebra Essentials

A.4.1 Vectors and Vector Operations

A vector $\mathbf{x} \in \mathbb{R}^n$ is an ordered list of $n$ real numbers:

$$\mathbf{x} = \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix}$$

Dot product (inner product):

$$\mathbf{x} \cdot \mathbf{y} = \mathbf{x}^\top \mathbf{y} = \sum_{i=1}^{n} x_i y_i$$

Euclidean norm:

$$\|\mathbf{x}\| = \sqrt{\mathbf{x}^\top \mathbf{x}} = \sqrt{\sum_{i=1}^{n} x_i^2}$$

Application: In Chapter 17, we compute distances between players on the pitch. If player $A$ is at position $\mathbf{p}_A = (x_A, y_A)$ and player $B$ at $\mathbf{p}_B = (x_B, y_B)$, the Euclidean distance is:

$$d(A, B) = \|\mathbf{p}_A - \mathbf{p}_B\| = \sqrt{(x_A - x_B)^2 + (y_A - y_B)^2}$$

A.4.2 Matrices and Matrix Operations

A matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ has $m$ rows and $n$ columns.

Matrix multiplication: If $\mathbf{A} \in \mathbb{R}^{m \times p}$ and $\mathbf{B} \in \mathbb{R}^{p \times n}$, then $\mathbf{C} = \mathbf{AB} \in \mathbb{R}^{m \times n}$ with:

$$c_{ij} = \sum_{k=1}^{p} a_{ik} b_{kj}$$

Transpose: $(\mathbf{AB})^\top = \mathbf{B}^\top \mathbf{A}^\top$

Identity matrix: $\mathbf{I}_n$ is the $n \times n$ matrix with ones on the diagonal and zeros elsewhere. $\mathbf{AI} = \mathbf{IA} = \mathbf{A}$.

Inverse: If $\mathbf{A}$ is square and non-singular, $\mathbf{A}^{-1}$ satisfies $\mathbf{A}\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}$.

A.4.3 Eigenvalues and Eigenvectors

For a square matrix $\mathbf{A}$, a scalar $\lambda$ and non-zero vector $\mathbf{v}$ satisfying:

$$\mathbf{A}\mathbf{v} = \lambda \mathbf{v}$$

are called an eigenvalue and eigenvector of $\mathbf{A}$, respectively.

Application: Principal Component Analysis (PCA), used in Chapter 21 for player profiling and dimensionality reduction, relies on computing the eigenvalues and eigenvectors of the covariance matrix $\mathbf{\Sigma}$. The eigenvectors define the principal component directions, and the eigenvalues quantify the variance explained by each component.

A.4.4 Positive Definite Matrices

A symmetric matrix $\mathbf{A}$ is positive definite if $\mathbf{x}^\top \mathbf{A} \mathbf{x} > 0$ for all non-zero $\mathbf{x}$. Covariance matrices are always positive semi-definite. Positive definiteness ensures that the OLS solution $(\mathbf{X}^\top \mathbf{X})^{-1}$ exists.

A.4.5 Matrix Decompositions

Singular Value Decomposition (SVD): Any matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ can be factored as:

$$\mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^\top$$

where $\mathbf{U} \in \mathbb{R}^{m \times m}$ and $\mathbf{V} \in \mathbb{R}^{n \times n}$ are orthogonal matrices, and $\mathbf{\Sigma} \in \mathbb{R}^{m \times n}$ is diagonal with non-negative singular values.

Cholesky Decomposition: For a positive definite matrix $\mathbf{A}$:

$$\mathbf{A} = \mathbf{L}\mathbf{L}^\top$$

where $\mathbf{L}$ is lower triangular. Used for efficient computation in multivariate normal sampling and Gaussian process models (Chapter 24).

A.5 Calculus Reference

A.5.1 Derivatives

The derivative of $f(x)$ at point $x$ measures the instantaneous rate of change:

$$f'(x) = \frac{df}{dx} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$

Common derivatives used in this book:

Function $f(x)$	Derivative $f'(x)$
$x^n$	$nx^{n-1}$
$e^x$	$e^x$
$\ln x$	$1/x$
$\frac{1}{1 + e^{-x}}$ (sigmoid)	$f(x)(1 - f(x))$
$\log(1 + e^x)$ (softplus)	$\frac{1}{1 + e^{-x}}$

Chain rule: If $y = f(g(x))$, then $\frac{dy}{dx} = f'(g(x)) \cdot g'(x)$.

Product rule: $(fg)' = f'g + fg'$.

A.5.2 Partial Derivatives and Gradients

For a multivariate function $f(\mathbf{x}) = f(x_1, x_2, \ldots, x_n)$, the partial derivative with respect to $x_j$ is:

$$\frac{\partial f}{\partial x_j} = \lim_{h \to 0} \frac{f(x_1, \ldots, x_j + h, \ldots, x_n) - f(\mathbf{x})}{h}$$

The gradient collects all partial derivatives:

$$\nabla f = \begin{pmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{pmatrix}$$

The gradient points in the direction of steepest ascent.

A.5.3 The Hessian

The Hessian matrix $\mathbf{H}$ contains second-order partial derivatives:

$$[\mathbf{H}]_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$

The Hessian characterizes the curvature of $f$. At a critical point where $\nabla f = \mathbf{0}$:

If $\mathbf{H}$ is positive definite, the point is a local minimum.
If $\mathbf{H}$ is negative definite, the point is a local maximum.
If $\mathbf{H}$ is indefinite, the point is a saddle point.

A.5.4 Integration

The definite integral of $f(x)$ from $a$ to $b$:

$$\int_a^b f(x) \, dx$$

represents the signed area under the curve. Key results:

$\int_a^b c \, dx = c(b - a)$
$\int_a^b x^n \, dx = \frac{x^{n+1}}{n+1} \Big|_a^b$, for $n \neq -1$
$\int_a^b e^{-x} \, dx = -e^{-x} \Big|_a^b$

Application: In computing expected goals (Chapter 7), we integrate the xG probability surface over a region of the pitch to find the expected number of goals from shots originating in that region:

$$\text{xG}_{\text{region}} = \iint_{\mathcal{R}} p(x, y) \, dA$$

A.6 Optimization Basics

A.6.1 Unconstrained Optimization

We seek to minimize (or maximize) an objective function $f(\boldsymbol{\theta})$ over parameters $\boldsymbol{\theta} \in \mathbb{R}^p$.

Necessary condition for a minimum: $\nabla f(\boldsymbol{\theta}^*) = \mathbf{0}$.

Sufficient condition: $\nabla f(\boldsymbol{\theta}^*) = \mathbf{0}$ and $\mathbf{H}(\boldsymbol{\theta}^*)$ is positive definite.

A.6.2 Maximum Likelihood Estimation

Given observations $\mathbf{x} = (x_1, \ldots, x_n)$ and a parametric model with parameter $\boldsymbol{\theta}$, the likelihood function is:

$$\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^{n} f(x_i \mid \boldsymbol{\theta})$$

The log-likelihood is often easier to work with:

$$\ell(\boldsymbol{\theta}) = \sum_{i=1}^{n} \ln f(x_i \mid \boldsymbol{\theta})$$

The maximum likelihood estimate (MLE) is:

$$\hat{\boldsymbol{\theta}}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta})$$

Example: For Poisson-distributed goal counts $x_1, \ldots, x_n$ with parameter $\lambda$:

$$\ell(\lambda) = \sum_{i=1}^{n} \left[ x_i \ln \lambda - \lambda - \ln(x_i!) \right]$$

Setting $\frac{d\ell}{d\lambda} = 0$ yields $\hat{\lambda}_{\text{MLE}} = \bar{x}$, the sample mean.

A.6.3 Gradient Descent

When the MLE or other optimization problem lacks a closed-form solution, we use iterative methods. Gradient descent updates parameters by moving in the direction opposite to the gradient:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta \nabla f(\boldsymbol{\theta}^{(t)})$$

where $\eta > 0$ is the learning rate.

Stochastic Gradient Descent (SGD) approximates the gradient using a random subset (mini-batch) of data, which is essential for training neural networks on large tracking datasets (Chapter 24).

Variants:

Momentum: $\boldsymbol{v}^{(t+1)} = \gamma \boldsymbol{v}^{(t)} + \eta \nabla f$; $\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \boldsymbol{v}^{(t+1)}$
Adam: Adaptive learning rates using first and second moment estimates. The default optimizer for most deep learning models in this text.

A.6.4 Regularization

To prevent overfitting, we add a penalty term to the objective:

Ridge Regression (L2):

$$\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \left\{ \sum_{i=1}^{n}(y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 + \lambda \|\boldsymbol{\beta}\|_2^2 \right\}$$

Lasso (L1):

$$\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \left\{ \sum_{i=1}^{n}(y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 + \lambda \|\boldsymbol{\beta}\|_1 \right\}$$

Elastic Net combines L1 and L2:

$$\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \left\{ \sum_{i=1}^{n}(y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 + \lambda_1 \|\boldsymbol{\beta}\|_1 + \lambda_2 \|\boldsymbol{\beta}\|_2^2 \right\}$$

The regularization parameter $\lambda$ controls the trade-off between fitting the data and keeping model complexity low. Cross-validation is typically used to select $\lambda$ (see Chapter 19).

A.6.5 Constrained Optimization and Lagrange Multipliers

To minimize $f(\boldsymbol{\theta})$ subject to constraint $g(\boldsymbol{\theta}) = 0$, we form the Lagrangian:

$$\mathcal{L}(\boldsymbol{\theta}, \mu) = f(\boldsymbol{\theta}) + \mu \, g(\boldsymbol{\theta})$$

The optimal solution satisfies $\nabla_{\boldsymbol{\theta}} \mathcal{L} = \mathbf{0}$ and $g(\boldsymbol{\theta}) = 0$.

Application: In Chapter 25, we formulate squad optimization as a constrained problem: maximize expected performance subject to budget constraints and squad size limits.

A.6.6 Convexity

A function $f$ is convex if for all $\mathbf{x}, \mathbf{y}$ and $\lambda \in [0,1]$:

$$f(\lambda \mathbf{x} + (1-\lambda)\mathbf{y}) \leq \lambda f(\mathbf{x}) + (1-\lambda) f(\mathbf{y})$$

For convex functions, any local minimum is also a global minimum. OLS regression and logistic regression are convex optimization problems, guaranteeing that gradient-based methods find the global optimum. Neural networks are generally non-convex, so we rely on good initialization and adaptive optimizers.

This appendix provides the mathematical scaffolding for the methods developed throughout the text. For extended treatments, we recommend Strang (2019) for linear algebra, Casella and Berger (2002) for statistical theory, and Boyd and Vandenberghe (2004) for optimization. Full citations appear in Appendix H.