Appendix A: Mathematical Foundations

This appendix provides the mathematical background required to fully engage with the material in this textbook. Readers with strong quantitative training may use it as a quick reference; those newer to mathematical modeling should study the relevant sections before tackling the corresponding chapters.


A.1 Notation Guide

The following table summarizes the principal symbols and conventions used throughout the book.

Symbol Meaning First Used
P(A) Probability of event A Ch. 1
P(A|B) Conditional probability of A given B Ch. 2
E[X] Expected value of random variable X Ch. 1
Var(X) Variance of X Ch. 3
SD(X) or sigma Standard deviation of X Ch. 3
Cov(X,Y) Covariance between X and Y Ch. 6
rho(X,Y) Pearson correlation coefficient Ch. 6
n Sample size Ch. 3
x-bar Sample mean Ch. 3
s Sample standard deviation Ch. 3
mu Population mean Ch. 3
sigma^2 Population variance Ch. 3
H_0 Null hypothesis Ch. 5
H_1 or H_a Alternative hypothesis Ch. 5
alpha Significance level (also: Type I error rate) Ch. 5
beta Type II error rate (also: regression coefficient) Ch. 5
p-hat Sample proportion Ch. 4
f* Optimal Kelly fraction Ch. 8
b Decimal odds minus 1 (net payout per unit) Ch. 8
EV Expected value of a bet Ch. 1
ROI Return on investment Ch. 1
CLV Closing line value Ch. 10
theta Model parameter (general) Ch. 12
L(theta) Likelihood function Ch. 12
ell(theta) Log-likelihood function Ch. 12
nabla Gradient operator Ch. 14
w Weight vector in ML models Ch. 14
eta Learning rate Ch. 14
lambda Regularization parameter (also: Poisson rate) Ch. 13
sigmoid(z) Logistic sigmoid function 1/(1+e^{-z}) Ch. 13
1_A Indicator function: 1 if A is true, 0 otherwise Ch. 2
Binom(n,p) Binomial distribution Ch. 3
Pois(lambda) Poisson distribution Ch. 7
N(mu, sigma^2) Normal distribution Ch. 3
Beta(a,b) Beta distribution Ch. 9
t_n Student's t-distribution with n degrees of freedom Ch. 5
chi^2_k Chi-squared distribution with k degrees of freedom Ch. 5
O(.) Big-O asymptotic notation Ch. 14
argmax Argument that maximizes a function Ch. 12
argmin Argument that minimizes a function Ch. 14

Convention notes:

  • Bold lowercase letters (x, w) denote column vectors.
  • Bold uppercase letters (X, A) denote matrices.
  • Subscripts index elements: x_i is the i-th element of vector x; X_{ij} is the element in row i, column j of matrix X.
  • Hats (^) over parameters denote estimators or fitted values: theta-hat is an estimator of theta; y-hat is a predicted value of y.
  • A tilde (~) denotes "is distributed as": X ~ N(0,1) means X follows a standard normal distribution.

A.2 Probability Fundamentals

A.2.1 Axioms of Probability

A probability measure P defined on a sample space Omega satisfies three axioms (Kolmogorov, 1933):

  1. Non-negativity. For every event A in the sample space, P(A) >= 0.
  2. Normalization. P(Omega) = 1; the probability that something in the sample space occurs is certain.
  3. Countable Additivity. For any countable sequence of mutually exclusive events A_1, A_2, ..., the probability of their union equals the sum of their individual probabilities: P(A_1 union A_2 union ...) = P(A_1) + P(A_2) + ...

From these axioms, several results follow immediately:

  • P(A^c) = 1 - P(A), where A^c is the complement of A.
  • P(empty set) = 0.
  • If A is a subset of B, then P(A) <= P(B).
  • P(A union B) = P(A) + P(B) - P(A intersection B) (inclusion-exclusion).

A.2.2 Conditional Probability

The conditional probability of A given B is defined as:

P(A|B) = P(A intersection B) / P(B),    provided P(B) > 0.

This is the foundation of Bayesian updating and is used throughout the book when we condition on partial game information (e.g., halftime scores, injury reports).

The Law of Total Probability. If B_1, B_2, ..., B_n form a partition of the sample space, then:

P(A) = sum_{i=1}^{n} P(A|B_i) * P(B_i).

This law is essential in Chapter 9 for computing marginal probabilities from conditional models.

A.2.3 Bayes' Theorem

Combining the definition of conditional probability with the law of total probability yields Bayes' theorem:

P(B_j | A) = P(A | B_j) * P(B_j) / sum_{i=1}^{n} P(A | B_i) * P(B_i).

In betting contexts: - P(B_j) is the prior probability (our belief before seeing data). - P(A | B_j) is the likelihood (probability of data given hypothesis B_j). - P(B_j | A) is the posterior probability (updated belief after data).

Bayes' theorem is the engine behind the Bayesian rating systems (Chapter 9), in-play probability updates (Chapter 25), and model calibration (Chapter 17).

A.2.4 Independence

Events A and B are independent if and only if P(A intersection B) = P(A) * P(B). Equivalently, P(A|B) = P(A). Independence is a strong assumption. In sports betting, game outcomes are often conditionally dependent (through shared variables like weather, roster changes, or market movements), even when they are marginally close to independent. Chapter 29 discusses correlation structures in parlay construction.

A.2.5 Random Variables and Expectation

A random variable X maps outcomes in the sample space to real numbers. For a discrete random variable:

E[X] = sum_x x * P(X = x).

For a continuous random variable with density f(x):

E[X] = integral from -infinity to infinity of x * f(x) dx.

Key properties of expectation: - Linearity: E[aX + bY] = aE[X] + bE[Y] (always holds, even without independence). - Product rule under independence: If X and Y are independent, E[XY] = E[X]E[Y]. - Variance formula: Var(X) = E[X^2] - (E[X])^2. - Variance of a sum: Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y).


A.3 Statistics Review

A.3.1 Key Distributions

Bernoulli(p). Takes value 1 with probability p and 0 with probability 1-p. Mean = p, Variance = p(1-p). Models a single win/loss outcome.

Binomial(n, p). Sum of n independent Bernoulli trials. P(X = k) = C(n,k) * p^k * (1-p)^{n-k}. Mean = np, Variance = np(1-p). Models win counts over a fixed number of bets.

Poisson(lambda). P(X = k) = e^{-lambda} * lambda^k / k!. Mean = Variance = lambda. Models goal/run counts in soccer, hockey, and baseball (Chapters 7, 33, 35).

Normal(mu, sigma^2). The continuous bell curve: f(x) = (1 / (sigma * sqrt(2pi))) * exp(-(x - mu)^2 / (2sigma^2)). The central limit theorem guarantees that sample means approach normality for large n. Point spread models rely heavily on normality assumptions (Chapter 6).

Student's t(nu). Arises when estimating a normal mean with unknown variance. Heavier tails than the normal. Critical for small-sample hypothesis testing.

Beta(a, b). Defined on [0,1]. The conjugate prior for the Bernoulli likelihood, making it the natural distribution for modeling win probabilities in Bayesian analyses (Chapter 9). Mean = a/(a+b).

Gamma(alpha, beta) and Exponential(lambda). The exponential models inter-event times (e.g., time between goals). The gamma generalizes it to the sum of independent exponentials. Used in in-play models (Chapter 25).

A.3.2 Hypothesis Testing

A hypothesis test evaluates whether observed data are consistent with a null hypothesis H_0.

  1. State H_0 and H_a.
  2. Choose significance level alpha (commonly 0.05).
  3. Compute the test statistic from the data.
  4. Determine the p-value: the probability of observing a result at least as extreme as the test statistic under H_0.
  5. Reject H_0 if p-value < alpha.

Common tests used in this book:

  • One-sample z-test: Test whether a bettor's win rate p-hat differs from the break-even rate p_0. z = (p-hat - p_0) / sqrt(p_0*(1-p_0)/n).
  • Two-sample t-test: Compare means of two groups (e.g., model A performance vs. model B).
  • Chi-squared goodness-of-fit: Test whether observed frequency distributions match expected distributions (e.g., calibration checks in Chapter 17).
  • Paired t-test: Compare two models evaluated on the same games.

Multiple testing. When running many simultaneous tests (e.g., evaluating dozens of betting systems), the probability of at least one false positive inflates dramatically. The Bonferroni correction adjusts alpha to alpha/m for m tests. The Benjamini-Hochberg procedure controls the false discovery rate and is preferred in exploratory analyses (Chapter 5).

A.3.3 Regression

Simple Linear Regression. y = beta_0 + beta_1 * x + epsilon, where epsilon ~ N(0, sigma^2). The least-squares estimates minimize the sum of squared residuals: beta_1-hat = Cov(x,y)/Var(x); beta_0-hat = y-bar - beta_1-hat * x-bar.

Multiple Linear Regression. y = X * beta + epsilon, solved in matrix form as beta-hat = (X^T X)^{-1} X^T y. Assumptions: linearity, independence of errors, homoscedasticity, normality of errors.

Logistic Regression. For binary outcomes (win/loss), we model the log-odds: log(p/(1-p)) = X * beta. The coefficients are estimated via maximum likelihood. This is the workhorse model for game outcome prediction (Chapters 13, 20-23).

Regularized Regression. Ridge regression adds a penalty lambda * ||beta||^2 to the loss function; Lasso uses lambda * ||beta||_1. Elastic net combines both. These techniques prevent overfitting when the number of predictors is large relative to the sample size (Chapter 13).

A.3.4 Confidence Intervals

A (1 - alpha) confidence interval for a parameter theta is an interval [L, U] constructed from data such that, across repeated sampling, the true theta falls within the interval with probability 1 - alpha.

For the mean of a normal population with known variance: x-bar +/- z_{alpha/2} * (sigma / sqrt(n)).

For the mean with unknown variance (small n): x-bar +/- t_{alpha/2, n-1} * (s / sqrt(n)).

For a proportion: p-hat +/- z_{alpha/2} * sqrt(p-hat * (1 - p-hat) / n).


A.4 Linear Algebra Essentials

A.4.1 Vectors

A vector x in R^n is an ordered n-tuple of real numbers: x = (x_1, x_2, ..., x_n)^T. Operations:

  • Addition: x + y = (x_1 + y_1, ..., x_n + y_n)^T.
  • Scalar multiplication: c * x = (cx_1, ..., cx_n)^T.
  • Dot product: x . y = sum_{i=1}^{n} x_i * y_i. This measures similarity between vectors.
  • Norm: ||x|| = sqrt(x . x). The Euclidean length of the vector.

In machine learning contexts, feature vectors represent games (Chapter 14). The dot product between a feature vector and a weight vector produces a prediction score.

A.4.2 Matrices

A matrix A of dimension m x n has m rows and n columns. Key operations:

  • Multiplication: If A is m x n and B is n x p, then C = A * B is m x p with C_{ij} = sum_{k=1}^{n} A_{ik} * B_{kj}.
  • Transpose: (A^T){ij} = A. For any matrices, (AB)^T = B^T * A^T.
  • Inverse: If A is square and non-singular, A^{-1} exists such that A * A^{-1} = I (the identity matrix).
  • Determinant: det(A) is a scalar that is zero if and only if A is singular.

A.4.3 Solving Linear Systems

The system A * x = b has a unique solution x = A^{-1} * b when A is invertible. In practice, direct inversion is avoided in favor of more numerically stable methods:

  • LU decomposition: Factor A = L * U (lower and upper triangular), then solve by forward and back substitution.
  • QR decomposition: Factor A = Q * R (orthogonal and upper triangular). Used in least-squares regression.
  • Singular Value Decomposition (SVD): A = U * Sigma * V^T. Provides the best low-rank approximation and is used in dimensionality reduction (Chapter 15).

A.4.4 Eigenvalues and Eigenvectors

If A * v = lambda * v for a nonzero vector v, then lambda is an eigenvalue and v is the corresponding eigenvector. The eigendecomposition of a symmetric matrix A = V * Lambda * V^T is the foundation of Principal Component Analysis (PCA), used in Chapter 15 for feature reduction.


A.5 Calculus Reference

A.5.1 Derivatives

The derivative f'(x) = lim_{h -> 0} [f(x+h) - f(x)] / h measures the instantaneous rate of change. Key rules:

  • Power rule: d/dx [x^n] = n * x^{n-1}.
  • Product rule: d/dx [fg] = f'g + f*g'.
  • Chain rule: d/dx [f(g(x))] = f'(g(x)) * g'(x).
  • Exponential: d/dx [e^x] = e^x.
  • Logarithm: d/dx [ln(x)] = 1/x.
  • Sigmoid: d/dx [sigmoid(x)] = sigmoid(x) * (1 - sigmoid(x)).

Partial derivatives. For f(x_1, ..., x_n), the partial derivative df/dx_i treats all other variables as constants. The gradient is the vector of all partial derivatives: nabla f = (df/dx_1, ..., df/dx_n)^T.

A.5.2 Integrals

The definite integral from a to b of f(x) dx computes the signed area under the curve. Key results:

  • integral of x^n dx = x^{n+1}/(n+1) + C, for n != -1.
  • integral of e^x dx = e^x + C.
  • integral of 1/x dx = ln|x| + C.
  • The Gaussian integral: integral from -infinity to infinity of e^{-x^2} dx = sqrt(pi).

Integrals arise in computing expected values, cumulative distribution functions, and Bayesian posterior distributions.

A.5.3 Optimization

To find the maximum or minimum of f(x):

  1. Compute f'(x) and set it equal to zero to find critical points.
  2. Use the second derivative test: f''(x) > 0 at a critical point implies a local minimum; f''(x) < 0 implies a local maximum.

Multivariate optimization. Set nabla f = 0 and check the Hessian matrix H (the matrix of second partial derivatives). If H is positive definite at a critical point, it is a local minimum; if negative definite, a local maximum.

Constrained optimization uses Lagrange multipliers: to optimize f(x) subject to g(x) = 0, solve nabla f = lambda * nabla g along with the constraint.

Gradient descent. When analytical solutions are unavailable, iterative optimization is used: x_{t+1} = x_t - eta * nabla f(x_t), where eta is the learning rate. Variants include stochastic gradient descent (SGD), Adam, and RMSProp (Chapter 14).


A.6 Proofs of Key Results

A.6.1 Derivation of the Kelly Criterion

Setup. A bettor faces a repeated wager with probability p of winning and probability q = 1 - p of losing. A winning bet pays b-to-1 (i.e., a bet of size f returns fb in profit). The bettor seeks the fraction f of bankroll to wager on each bet to maximize long-run growth.

Derivation. After one bet, the bankroll is multiplied by (1 + f*b) with probability p, or by (1 - f) with probability q. After N bets with W wins and L = N - W losses, the bankroll is:

B_N = B_0 * (1 + f*b)^W * (1 - f)^L.

The growth rate per bet is:

G(f) = (1/N) * ln(B_N / B_0) = (W/N) * ln(1 + f*b) + (L/N) * ln(1 - f).

By the law of large numbers, W/N -> p and L/N -> q as N -> infinity. Therefore the expected growth rate is:

G(f) = p * ln(1 + f*b) + q * ln(1 - f).

To maximize, take the derivative with respect to f and set it to zero:

dG/df = p*b / (1 + f*b) - q / (1 - f) = 0.

Solving:

p*b*(1 - f) = q*(1 + f*b)
p*b - p*b*f = q + q*b*f
p*b - q = f*b*(p + q)
p*b - q = f*b        (since p + q = 1)

Therefore:

f* = (p*b - q) / b = p - q/b.

This is the Kelly criterion. Note that f > 0 only when pb > q, i.e., when the bet has positive expected value. The Kelly fraction is the edge divided by the odds: f = edge / b, where edge = pb - q = p*(b+1) - 1.

Second derivative check: d^2G/df^2 = -pb^2/(1+fb)^2 - q/(1-f)^2 < 0 for all f in (0,1), confirming that f* is a global maximum.

A.6.2 Derivation of the Elo Rating Update

Setup. Two players (or teams) with ratings R_A and R_B compete. The Elo system models the expected score of player A as:

E_A = 1 / (1 + 10^{(R_B - R_A)/400}).

This is a logistic function with base 10 scaled so that a 400-point rating difference corresponds to a 10:1 odds ratio.

Derivation from logistic regression. Define the log-odds of A winning as:

ln(E_A / (1 - E_A)) = c * (R_A - R_B),

where c = ln(10)/400 is the scaling constant. This is equivalent to a logistic regression with a single feature (the rating difference) and a fixed coefficient.

The update rule after a game where A scores S_A (1 for win, 0.5 for draw, 0 for loss) is:

R_A(new) = R_A + K * (S_A - E_A),

where K is the update factor controlling the speed of adaptation.

Justification via gradient descent. Consider the log-loss (cross-entropy) for a single observation:

L = -[S_A * ln(E_A) + (1 - S_A) * ln(1 - E_A)].

Taking the derivative with respect to R_A:

dL/dR_A = -[S_A / E_A - (1 - S_A) / (1 - E_A)] * dE_A/dR_A.

Since E_A = sigmoid(c*(R_A - R_B)), we have dE_A/dR_A = c * E_A * (1 - E_A). Substituting:

dL/dR_A = -c * [S_A * (1 - E_A) - (1 - S_A) * E_A]
        = -c * [S_A - E_A].

A gradient descent step with step size eta gives:

R_A(new) = R_A - eta * dL/dR_A = R_A + eta * c * (S_A - E_A).

Setting K = eta * c recovers the standard Elo update. Thus the Elo update is a single step of stochastic gradient descent on the log-loss objective, with the K-factor controlling the effective learning rate.

A.6.3 Proof That No Betting System Can Overcome Negative Expected Value

Theorem. If each individual bet in a sequence has negative expected value, then no staking strategy (system of varying bet sizes based on past outcomes) can produce positive expected value for the overall sequence.

Proof. Let B_0 be the initial bankroll and B_n the bankroll after n bets. Define the profit on bet i as X_i, where E[X_i | history up to bet i-1] < 0 by assumption. The bettor's stake s_i on bet i may depend on the history of outcomes, but the conditional expected profit is s_i * E[X_i per unit | history] < 0. The total expected profit is:

E[B_n - B_0] = E[sum_{i=1}^{n} s_i * X_i] = sum_{i=1}^{n} E[E[s_i * X_i | history]] < 0,

by the tower property of conditional expectation and the fact that s_i > 0 and each conditional expectation is negative. No rearrangement of bet sizes, no doubling strategy (Martingale), and no stop-loss rule can change the sign of the overall expectation. This fundamental result, a consequence of the optional stopping theorem for supermartingales, underscores why edge identification is the irreducible core of profitable betting.


This appendix is intended as a reference companion to the main text. For deeper treatment of any topic, consult the sources listed in Appendix H: Bibliography.