Appendix C: Math Reference for Machine Learning

This appendix collects the mathematical foundations used throughout the book. Every formula appears with its plain-English interpretation and, where helpful, its numpy implementation. This is a reference, not a textbook — look up what you need, skip what you don't.


Gradient Descent

Gradient descent finds the parameters theta that minimize a loss function L(theta) by iteratively stepping in the direction of steepest decrease.

Update Rule

theta_new = theta_old - alpha * gradient(L(theta_old))

Where alpha is the learning rate.

Plain English: Look at the slope of the loss function at your current position. Step downhill. Repeat until the ground is flat.

Variants

Batch gradient descent computes the gradient using the entire dataset:

theta = theta - alpha * (1/m) * sum(gradient_i for i in all_samples)

Stochastic gradient descent (SGD) uses one random sample per step:

theta = theta - alpha * gradient_i

Mini-batch gradient descent uses a small batch (typically 32-256 samples):

theta = theta - alpha * (1/batch_size) * sum(gradient_i for i in batch)

Numpy implementation (batch gradient descent for linear regression):

def gradient_descent(X, y, lr=0.01, epochs=1000):
    m, n = X.shape
    theta = np.zeros(n)
    for epoch in range(epochs):
        gradient = (2 / m) * X.T @ (X @ theta - y)
        theta = theta - lr * gradient
    return theta

Learning Rate Effects

  • Too large: loss oscillates or diverges (exploding gradients)
  • Too small: convergence is painfully slow, may get stuck in local minima
  • Common starting values: 0.001, 0.01, 0.1
  • Learning rate schedules reduce alpha over time: alpha_t = alpha_0 / (1 + decay * t)

Loss Function Catalog

Mean Squared Error (MSE) -- Regression

MSE = (1/m) * sum((y_i - y_hat_i)^2)

Plain English: Average the squared differences between predictions and actual values. Squaring penalizes large errors heavily.

Gradient: dL/d(y_hat) = (2/m) * (y_hat - y)

mse = np.mean((y_pred - y_true) ** 2)

Mean Absolute Error (MAE) -- Regression

MAE = (1/m) * sum(|y_i - y_hat_i|)

Plain English: Average the absolute differences. More robust to outliers than MSE because it doesn't square the errors.

mae = np.mean(np.abs(y_pred - y_true))

Huber Loss -- Regression

L(r) = 0.5 * r^2              if |r| <= delta
L(r) = delta * |r| - 0.5 * delta^2   if |r| > delta

Where r = y - y_hat.

Plain English: Acts like MSE for small errors and MAE for large errors. The parameter delta controls the transition point. Best of both worlds: smooth gradient near zero, robust to outliers.

Cross-Entropy / Log-Loss -- Binary Classification

L = -(1/m) * sum(y_i * log(p_i) + (1 - y_i) * log(1 - p_i))

Where p_i is the predicted probability and y_i is the true label (0 or 1).

Plain English: Measures how far your predicted probabilities are from the true labels. A confident wrong prediction (predict 0.99 when the answer is 0) is punished severely.

Gradient (with respect to p): dL/dp = -(y/p) + (1-y)/(1-p)

epsilon = 1e-15  # Prevent log(0)
p = np.clip(y_pred_proba, epsilon, 1 - epsilon)
logloss = -np.mean(y_true * np.log(p) + (1 - y_true) * np.log(1 - p))

Multi-Class Cross-Entropy -- Multi-Class Classification

L = -(1/m) * sum_i(sum_k(y_ik * log(p_ik)))

Where k indexes classes, y_ik is 1 if sample i belongs to class k, and p_ik is the predicted probability.

Hinge Loss -- SVM

L = (1/m) * sum(max(0, 1 - y_i * f(x_i)))

Where y_i is in {-1, +1} and f(x_i) is the raw model output (not a probability).

Plain English: Zero loss for correct predictions that are confident enough (margin > 1). Penalizes predictions that are correct but inside the margin, or incorrect. This is what produces the support vector property — only points near the boundary matter.


Regularization

L1 Regularization (Lasso)

L_total = L_data + lambda * sum(|theta_j|)

Plain English: Add the sum of absolute coefficient values to the loss. Large lambda drives coefficients to exactly zero, performing feature selection. The resulting model is sparse.

Effect on gradient: Adds lambda * sign(theta_j) to the gradient for each parameter. The sign function creates a discontinuity at zero, which is why coefficients hit exactly zero.

L2 Regularization (Ridge)

L_total = L_data + lambda * sum(theta_j^2)

Plain English: Add the sum of squared coefficient values to the loss. Shrinks all coefficients toward zero but never reaches exactly zero. Distributes weight across correlated features.

Effect on gradient: Adds 2 * lambda * theta_j to the gradient. Smooth, proportional shrinkage.

Elastic Net

L_total = L_data + lambda * (rho * sum(|theta_j|) + (1 - rho) * sum(theta_j^2))

Where rho (l1_ratio) controls the mix: rho=1 is pure L1, rho=0 is pure L2.

Plain English: Get the feature selection of Lasso and the coefficient stability of Ridge in one penalty. Useful when features are correlated — Lasso arbitrarily picks one correlated feature, while Elastic Net spreads the weight.

Regularization Strength in scikit-learn

Different estimators parameterize regularization differently:

Estimator Parameter Interpretation
Ridge, Lasso, ElasticNet alpha Higher alpha = stronger regularization
LogisticRegression, SVC C Higher C = weaker regularization (C = 1/alpha)
XGBoost reg_alpha (L1), reg_lambda (L2) Higher = stronger

Probability Distributions

Normal (Gaussian) Distribution

f(x) = (1 / (sigma * sqrt(2*pi))) * exp(-(x - mu)^2 / (2 * sigma^2))

Parameters: mu (mean), sigma (standard deviation).

Plain English: The bell curve. Centered at the mean, spread controlled by the standard deviation. Most continuous data in nature approximates this shape due to the Central Limit Theorem.

from scipy.stats import norm

# PDF, CDF, and sampling
x = np.linspace(-4, 4, 100)
pdf = norm.pdf(x, loc=0, scale=1)
cdf = norm.cdf(1.96)  # 0.975 -- 97.5th percentile
samples = norm.rvs(loc=0, scale=1, size=1000)

Bernoulli Distribution

P(X = 1) = p
P(X = 0) = 1 - p

Plain English: A single coin flip. The simplest distribution: one trial, two outcomes. The building block of binary classification.

Binomial Distribution

P(X = k) = C(n, k) * p^k * (1-p)^(n-k)

Parameters: n (number of trials), p (probability of success), k (number of successes).

Plain English: The number of heads in n coin flips. Used in A/B testing: if the true conversion rate is p, what is the probability of seeing k conversions in n visitors?

Poisson Distribution

P(X = k) = (lambda^k * e^(-lambda)) / k!

Parameter: lambda (average rate of events).

Plain English: Counts of rare events in a fixed interval. How many support tickets per day? How many equipment failures per month? Lambda is both the mean and the variance.

Exponential Distribution

f(x) = lambda * e^(-lambda * x),   x >= 0

Plain English: Time between events in a Poisson process. How long until the next equipment failure? Memoryless: the probability of failure in the next hour doesn't depend on how long the equipment has been running.


Bayes' Theorem

P(A|B) = P(B|A) * P(A) / P(B)

In ML terminology:

P(class|features) = P(features|class) * P(class) / P(features)
     posterior    =     likelihood    *   prior   /  evidence

Plain English: Update your beliefs with evidence. Start with a prior probability (how likely is this patient to be readmitted?), observe evidence (lab results, age, diagnosis), and compute the posterior probability (how likely is readmission given this evidence?).

Naive Bayes applies this by assuming features are independent:

P(class|x_1, x_2, ..., x_n) proportional to P(class) * product(P(x_i|class))

This assumption is almost always wrong. The classifier often works anyway because it only needs to rank classes correctly, not estimate probabilities accurately.


Linear Algebra Essentials

Vectors and Dot Products

a . b = sum(a_i * b_i) = |a| * |b| * cos(theta)

Plain English: The dot product measures how aligned two vectors are. Positive: same direction. Zero: perpendicular. Negative: opposite. SVMs use dot products to measure similarity; the kernel trick computes dot products in transformed spaces.

dot = np.dot(a, b)       # or a @ b
cosine_sim = dot / (np.linalg.norm(a) * np.linalg.norm(b))

Matrix Multiplication

C = A @ B
C[i][j] = sum(A[i][k] * B[k][j] for all k)

Requirements: A is (m x n), B is (n x p), result C is (m x p). The inner dimensions must match.

In ML: Linear regression predictions are a matrix multiplication: y_hat = X @ theta, where X is (m x n) features and theta is (n x 1) parameters.

C = A @ B          # Matrix multiplication
y_hat = X @ theta  # Linear model prediction

Transpose

A^T: swap rows and columns. A[i][j] becomes A^T[j][i]

Key properties: - (A @ B)^T = B^T @ A^T - (A^T)^T = A

A_T = A.T

Inverse

A^(-1): the matrix such that A @ A^(-1) = I (identity matrix)

In ML: The closed-form solution for linear regression: theta = (X^T @ X)^(-1) @ X^T @ y. In practice, we don't compute the inverse directly (numerically unstable); we use gradient descent or np.linalg.solve().

theta = np.linalg.solve(X.T @ X, X.T @ y)  # Better than inv()

Eigenvalues and Eigenvectors

A @ v = lambda * v

Where v is the eigenvector and lambda is the eigenvalue.

Plain English: An eigenvector is a direction that the matrix only stretches (doesn't rotate). The eigenvalue is the stretching factor. In PCA, the eigenvectors of the covariance matrix are the principal components, and the eigenvalues indicate how much variance each component explains.

eigenvalues, eigenvectors = np.linalg.eigh(covariance_matrix)
# Sort by eigenvalue (descending)
sorted_idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[sorted_idx]
eigenvectors = eigenvectors[:, sorted_idx]

Information Theory

Entropy

H(X) = -sum(p_i * log2(p_i))

Plain English: Measures the uncertainty or "surprise" in a distribution. A fair coin has maximum entropy (1 bit). A coin that always lands heads has zero entropy. Decision trees minimize entropy (or equivalently, maximize information gain) at each split.

from scipy.stats import entropy as scipy_entropy
H = scipy_entropy(probabilities, base=2)

Gini Impurity

Gini = 1 - sum(p_i^2)

Plain English: The probability that a randomly chosen sample would be misclassified if labeled according to the distribution. scikit-learn's default for decision trees. Computationally simpler than entropy and produces similar splits in practice.

gini = 1 - np.sum(probabilities ** 2)

Information Gain

IG(parent, split) = H(parent) - weighted_avg(H(children))

Plain English: How much entropy the split reduces. A good split produces children with lower entropy (purer groups) than the parent.

Mutual Information

MI(X, Y) = sum_x(sum_y(p(x,y) * log(p(x,y) / (p(x) * p(y)))))

Plain English: How much knowing X tells you about Y (and vice versa). Zero means independence. Used in feature selection as a more general alternative to correlation — captures non-linear relationships.

from sklearn.feature_selection import mutual_info_classif
mi_scores = mutual_info_classif(X, y, random_state=42)

Evaluation Metrics: Formulas

Classification

Accuracy    = (TP + TN) / (TP + TN + FP + FN)
Precision   = TP / (TP + FP)
Recall      = TP / (TP + FN)
F1          = 2 * Precision * Recall / (Precision + Recall)
Specificity = TN / (TN + FP)

Confusion matrix layout:

                Predicted Positive    Predicted Negative
Actual Positive       TP                    FN
Actual Negative       FP                    TN

Regression

MSE  = (1/m) * sum((y_i - y_hat_i)^2)
RMSE = sqrt(MSE)
MAE  = (1/m) * sum(|y_i - y_hat_i|)
MAPE = (1/m) * sum(|y_i - y_hat_i| / |y_i|) * 100
R^2  = 1 - (sum((y_i - y_hat_i)^2) / sum((y_i - y_bar)^2))

R-squared interpretation: The proportion of variance in y explained by the model. R-squared = 0.85 means the model explains 85% of the variance. Can be negative for very bad models (worse than predicting the mean).

Ranking

DCG@k  = sum_{i=1}^{k} (2^rel_i - 1) / log2(i + 1)
NDCG@k = DCG@k / IDCG@k

Where IDCG@k is the DCG of the ideal (perfect) ranking.

Plain English: NDCG measures how good a ranked list is, with items near the top weighted more heavily. A recommendation engine that puts the best items first scores near 1.0.


Distance Metrics

Euclidean Distance

d(a, b) = sqrt(sum((a_i - b_i)^2))

The straight-line distance. Sensitive to scale — always standardize features first when using this metric.

Manhattan Distance

d(a, b) = sum(|a_i - b_i|)

Distance measured along grid lines. Less sensitive to outliers in single dimensions than Euclidean.

Cosine Distance

d(a, b) = 1 - (a . b) / (|a| * |b|)

Measures the angle between vectors, ignoring magnitude. Standard for text similarity (TF-IDF vectors) and recommendation systems.

Mahalanobis Distance

d(x, mu) = sqrt((x - mu)^T * S^(-1) * (x - mu))

Where S is the covariance matrix.

Plain English: Euclidean distance that accounts for the shape (covariance) of the data. A point 3 units away in a high-variance direction is less anomalous than a point 3 units away in a low-variance direction.

from scipy.spatial.distance import mahalanobis
d = mahalanobis(x, mu, np.linalg.inv(cov_matrix))

The Softmax Function

softmax(z_i) = exp(z_i) / sum(exp(z_j) for all j)

Plain English: Converts a vector of raw scores (logits) into a probability distribution. Each output is between 0 and 1, and all outputs sum to 1. Used in multi-class classification to turn model outputs into class probabilities.

def softmax(z):
    exp_z = np.exp(z - np.max(z))  # Subtract max for numerical stability
    return exp_z / exp_z.sum()

The Sigmoid Function

sigma(z) = 1 / (1 + exp(-z))

Plain English: Squashes any real number into the range (0, 1). The link function in logistic regression: maps the linear combination of features to a probability. Sigmoid(0) = 0.5; large positive inputs approach 1; large negative inputs approach 0.

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

Derivative: sigma'(z) = sigma(z) * (1 - sigma(z)). This is why logistic regression has a clean gradient for optimization.


Key Inequalities and Rules of Thumb

  • Bias-variance decomposition: Expected Error = Bias^2 + Variance + Irreducible Noise
  • Bonferroni correction: When running k tests, use significance level alpha/k per test
  • VIF threshold: VIF > 5 means moderate multicollinearity; VIF > 10 means severe
  • PSI thresholds: < 0.1 stable; 0.1-0.25 moderate drift; > 0.25 significant drift
  • Silhouette score: > 0.7 strong structure; 0.5-0.7 reasonable; 0.25-0.5 weak; < 0.25 no meaningful structure
  • Sample size for A/B tests: n per group = 16 * sigma^2 / delta^2 (for power=0.8, alpha=0.05), where delta is the minimum detectable effect