24 min read

Every machine learning algorithm you have ever used is a sequence of matrix operations. When you call model.fit(X, y) in scikit-learn, the implementation multiplies matrices, solves linear systems, and computes decompositions. When you train a...

Chapter 1: Linear Algebra for Machine Learning

The Operations That Power Every Model


Learning Objectives

By the end of this chapter, you will be able to:

  1. Represent datasets, model parameters, and transformations as vectors and matrices
  2. Perform and interpret eigendecomposition and connect it to PCA and spectral methods
  3. Derive and implement SVD, explaining its role in dimensionality reduction and matrix factorization
  4. Apply matrix calculus (vector/matrix derivatives) to derive gradient expressions for ML models
  5. Implement key linear algebra operations in numpy and connect them to PyTorch tensor operations

1.1 — Why Linear Algebra Is the Language of Machine Learning

Every machine learning algorithm you have ever used is a sequence of matrix operations. When you call model.fit(X, y) in scikit-learn, the implementation multiplies matrices, solves linear systems, and computes decompositions. When you train a transformer with 175 billion parameters, each forward pass is a cascade of matrix multiplications — nothing more, nothing less. When you compute PCA, you are computing an eigendecomposition. When you build a recommender system, you are factoring a matrix. When you run backpropagation, you are computing a chain of Jacobian-vector products.

This is not a metaphor. It is literally what the code does.

The intermediate textbook gave you vectors, matrices, and dot products. You learned that matrix multiplication is not commutative, that the transpose swaps rows and columns, and that the inverse undoes a transformation. Those are correct and necessary facts, but they are the grammar of a language you have not yet learned to speak fluently.

This chapter teaches you to see the matrices.

We begin with a rapid review of vector spaces and subspaces — not because the definitions are new, but because the implications are new. The null space of a matrix tells you which inputs produce zero output; in machine learning, that tells you which perturbations to your features a model literally cannot detect. The rank of your data matrix tells you the intrinsic dimensionality of your data, regardless of how many columns it has.

From there, we build to the three pillars that every advanced practitioner must master:

  1. Eigendecomposition — the X-ray of a square matrix, revealing its fundamental modes of action
  2. Singular Value Decomposition — the fundamental theorem of data science, applicable to any matrix
  3. Matrix calculus — the language that backpropagation speaks, and the mathematical apparatus behind every gradient you have ever computed

Each concept receives three treatments: the intuition, the mathematics, and the implementation. By the end of this chapter, you will read linear algebra the way a musician reads sheet music — not translating symbol by symbol, but hearing the structure directly.


1.2 — Vector Spaces and Subspaces: The Stage

We begin with the stage on which all of machine learning is performed.

A vector space $V$ over the real numbers $\mathbb{R}$ is a set equipped with two operations — vector addition and scalar multiplication — satisfying the usual axioms (closure, associativity, commutativity of addition, existence of additive identity and inverses, distributivity, and compatibility of scalar multiplication). The canonical example is $\mathbb{R}^n$: the space of $n$-tuples of real numbers.

In machine learning, everything is a vector in some space:

  • A data point with $d$ features is a vector $\mathbf{x} \in \mathbb{R}^d$
  • A dataset of $n$ data points is a matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$, where each row is a data point
  • The weights of a neural network layer are a matrix $\mathbf{W} \in \mathbb{R}^{m \times n}$
  • A color image of size $224 \times 224$ is a tensor in $\mathbb{R}^{3 \times 224 \times 224}$

Common Misconception: "Vectors are arrows in 2D or 3D space." Arrows are one representation. In machine learning, a "vector" is any element of a vector space — a 768-dimensional word embedding, a 50,257-dimensional probability distribution over tokens, a flattened image with 150,528 entries. Geometric intuition from $\mathbb{R}^2$ is useful but can mislead in high dimensions. In high-dimensional spaces, almost all pairs of random vectors are nearly orthogonal — a fact with no analogue in our three-dimensional experience.

Basis, Span, and Linear Independence

A set of vectors $\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}$ is linearly independent if no vector in the set can be written as a linear combination of the others. Equivalently, the only solution to $c_1 \mathbf{v}_1 + \cdots + c_k \mathbf{v}_k = \mathbf{0}$ is $c_1 = \cdots = c_k = 0$.

The span of these vectors is the set of all linear combinations: $\text{span}(\mathbf{v}_1, \ldots, \mathbf{v}_k) = \{c_1 \mathbf{v}_1 + \cdots + c_k \mathbf{v}_k : c_i \in \mathbb{R}\}$.

A basis for a vector space is a linearly independent set that spans the entire space. The number of vectors in any basis is the dimension of the space. The standard basis for $\mathbb{R}^n$ consists of the $n$ one-hot vectors $\mathbf{e}_1, \ldots, \mathbf{e}_n$.

Why this matters for ML: PCA finds a new basis for your data — one in which the first basis vector captures the maximum variance, the second captures the maximum remaining variance, and so on. Choosing a good basis is the entire game of representation learning.

import numpy as np

# The standard basis for R^3
e1 = np.array([1.0, 0.0, 0.0])
e2 = np.array([0.0, 1.0, 0.0])
e3 = np.array([0.0, 0.0, 1.0])

# Any vector is a linear combination of basis vectors
v = np.array([3.0, -1.0, 4.0])
assert np.allclose(v, 3*e1 + (-1)*e2 + 4*e3)

# Check linear independence via rank
vectors = np.stack([e1, e2, e3])
print(f"Rank: {np.linalg.matrix_rank(vectors)}")  # 3 — full rank, linearly independent

The Rank of a Matrix

The rank of a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ is the dimension of its column space (equivalently, its row space). It tells you the number of truly independent pieces of information the matrix carries. A matrix with $n$ columns but rank $k < n$ has $n - k$ redundant dimensions.

def analyze_rank(X: np.ndarray) -> dict:
    """Analyze the rank and dimensionality of a data matrix.

    Args:
        X: Data matrix of shape (n_samples, n_features).

    Returns:
        Dictionary with rank analysis results.
    """
    n, d = X.shape
    rank = np.linalg.matrix_rank(X)
    return {
        "n_samples": n,
        "n_features": d,
        "rank": rank,
        "max_possible_rank": min(n, d),
        "redundant_dimensions": d - rank,
        "rank_deficient": rank < min(n, d),
    }

# Example: 100 samples in 10-D, but only 3 independent directions
rng = np.random.default_rng(42)
base = rng.standard_normal((100, 3))
mixing = rng.standard_normal((3, 10))
X = base @ mixing + 0.01 * rng.standard_normal((100, 10))  # Nearly rank-3

info = analyze_rank(X)
print(info)
# {'n_samples': 100, 'n_features': 10, 'rank': 10, ...}
# But the effective rank (via singular values) reveals the truth — see Section 1.4

Production Reality: In real datasets, exact rank deficiency is rare because noise gives every dimension some signal. The practical concept is effective rank — the number of singular values significantly larger than zero. We will make this precise in Section 1.4 when we develop SVD.

The Four Fundamental Subspaces

For any matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$, there are four fundamental subspaces (following Strang, 1993):

  1. Column space $C(\mathbf{A}) \subseteq \mathbb{R}^m$: the set of all possible outputs $\mathbf{Ax}$. If $\mathbf{b} \in C(\mathbf{A})$, the system $\mathbf{Ax} = \mathbf{b}$ has a solution.
  2. Null space $N(\mathbf{A}) \subseteq \mathbb{R}^n$: the set of all $\mathbf{x}$ such that $\mathbf{Ax} = \mathbf{0}$. These are the inputs the matrix annihilates.
  3. Row space $C(\mathbf{A}^\top) \subseteq \mathbb{R}^n$: the column space of $\mathbf{A}^\top$.
  4. Left null space $N(\mathbf{A}^\top) \subseteq \mathbb{R}^m$: the null space of $\mathbf{A}^\top$.

The key relationships: - $C(\mathbf{A})$ and $N(\mathbf{A}^\top)$ are orthogonal complements in $\mathbb{R}^m$ - $C(\mathbf{A}^\top)$ and $N(\mathbf{A})$ are orthogonal complements in $\mathbb{R}^n$ - $\dim(C(\mathbf{A})) = \dim(C(\mathbf{A}^\top)) = \text{rank}(\mathbf{A}) = r$ - $\dim(N(\mathbf{A})) = n - r$ and $\dim(N(\mathbf{A}^\top)) = m - r$

           R^n                                    R^m
  ┌─────────────────────┐             ┌─────────────────────┐
  │                     │             │                     │
  │   Row Space         │    A        │   Column Space      │
  │   C(A^T)        ────┼────────────>│   C(A)              │
  │   dim = r           │             │   dim = r           │
  │                     │             │                     │
  │ ─ ─ ─ ─ ─ ─ ─ ─ ─  │             │ ─ ─ ─ ─ ─ ─ ─ ─ ─  │
  │                     │    A        │                     │
  │   Null Space        │────────────>│   Left Null Space   │
  │   N(A)              │   → 0      │   N(A^T)            │
  │   dim = n - r       │             │   dim = m - r       │
  │                     │             │                     │
  └─────────────────────┘             └─────────────────────┘

ML interpretation: Consider a linear model $\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}$, where $\mathbf{X} \in \mathbb{R}^{n \times d}$ is the data matrix:

  • The column space of $\mathbf{X}$ is the set of all predictions the model can make. If $\mathbf{y}$ is not in $C(\mathbf{X})$, the model cannot perfectly fit the data.
  • The null space of $\mathbf{X}$ contains weight vectors that produce zero predictions on the training data: $\mathbf{Xw} = \mathbf{0}$. If the null space is nontrivial, the model has infinitely many equally good solutions — this is exactly the setting where regularization becomes essential.

1.3 — Eigendecomposition: The X-Ray of a Matrix

The eigendecomposition reveals the fundamental modes of action of a square matrix. If a matrix is a machine that transforms vectors, eigendecomposition tells you what directions are preserved and how much they are scaled.

Definition and Computation

For a square matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$, a scalar $\lambda$ is an eigenvalue and a nonzero vector $\mathbf{v}$ is the corresponding eigenvector if:

$$\mathbf{A}\mathbf{v} = \lambda \mathbf{v}$$

The matrix acts on $\mathbf{v}$ by simply scaling it. The eigenvector's direction is preserved; only its magnitude changes (and possibly its sign, if $\lambda < 0$).

If $\mathbf{A}$ has $n$ linearly independent eigenvectors, we can write the eigendecomposition:

$$\mathbf{A} = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^{-1}$$

where $\mathbf{V} = [\mathbf{v}_1 | \cdots | \mathbf{v}_n]$ is the matrix of eigenvectors (as columns) and $\mathbf{\Lambda} = \text{diag}(\lambda_1, \ldots, \lambda_n)$ is the diagonal matrix of eigenvalues.

def eigendecomposition_demo(A: np.ndarray) -> None:
    """Compute and verify the eigendecomposition of a square matrix.

    Args:
        A: Square matrix of shape (n, n).
    """
    eigenvalues, eigenvectors = np.linalg.eig(A)

    print("Eigenvalues:", eigenvalues)
    print("Eigenvectors (columns):\n", eigenvectors)

    # Verify: A @ v = lambda * v for each eigenpair
    for i in range(len(eigenvalues)):
        lhs = A @ eigenvectors[:, i]
        rhs = eigenvalues[i] * eigenvectors[:, i]
        assert np.allclose(lhs, rhs), f"Eigenpair {i} failed verification"

    # Verify: A = V @ Lambda @ V^{-1}
    Lambda = np.diag(eigenvalues)
    V = eigenvectors
    A_reconstructed = V @ Lambda @ np.linalg.inv(V)
    assert np.allclose(A, A_reconstructed), "Decomposition failed"
    print("Eigendecomposition verified: A = V Λ V^{-1}")

# 2x2 example
A = np.array([[3.0, 1.0],
              [0.0, 2.0]])
eigendecomposition_demo(A)

The Spectral Theorem for Symmetric Matrices

In machine learning, most matrices we decompose are symmetric (or more precisely, real symmetric): $\mathbf{A} = \mathbf{A}^\top$. Covariance matrices, kernel matrices, and Hessian matrices are all symmetric. The spectral theorem gives us extraordinarily strong guarantees for these matrices:

Mathematical Foundation: Spectral Theorem. If $\mathbf{A} \in \mathbb{R}^{n \times n}$ is symmetric, then: 1. All eigenvalues of $\mathbf{A}$ are real. 2. Eigenvectors corresponding to distinct eigenvalues are orthogonal. 3. $\mathbf{A}$ has a complete set of $n$ orthonormal eigenvectors. 4. $\mathbf{A} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^\top$, where $\mathbf{Q}$ is orthogonal ($\mathbf{Q}^\top \mathbf{Q} = \mathbf{I}$).

This means $\mathbf{V}^{-1} = \mathbf{V}^\top$ — the inverse is free.

The spectral decomposition $\mathbf{A} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^\top$ can be written as a sum of rank-1 matrices:

$$\mathbf{A} = \sum_{i=1}^n \lambda_i \mathbf{q}_i \mathbf{q}_i^\top$$

Each term $\lambda_i \mathbf{q}_i \mathbf{q}_i^\top$ is a rank-1 matrix that projects onto the direction $\mathbf{q}_i$ and scales by $\lambda_i$. The eigendecomposition decomposes a matrix into its fundamental modes.

def spectral_decomposition(A_symmetric: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
    """Compute the spectral decomposition of a symmetric matrix.

    Args:
        A_symmetric: Symmetric matrix of shape (n, n).

    Returns:
        Tuple of (eigenvalues, eigenvectors) where eigenvectors are orthonormal columns.
    """
    # np.linalg.eigh is specialized for symmetric (Hermitian) matrices
    # It guarantees real eigenvalues and orthonormal eigenvectors
    eigenvalues, Q = np.linalg.eigh(A_symmetric)

    # eigh returns eigenvalues in ascending order; reverse for descending
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    Q = Q[:, idx]

    # Verify orthogonality: Q^T Q = I
    assert np.allclose(Q.T @ Q, np.eye(len(eigenvalues))), "Eigenvectors not orthonormal"

    # Verify reconstruction: A = Q Lambda Q^T
    A_reconstructed = Q @ np.diag(eigenvalues) @ Q.T
    assert np.allclose(A_symmetric, A_reconstructed), "Reconstruction failed"

    return eigenvalues, Q

# The covariance matrix is always symmetric
rng = np.random.default_rng(42)
X = rng.standard_normal((1000, 5))
cov = X.T @ X / (X.shape[0] - 1)  # Sample covariance

eigenvalues, Q = spectral_decomposition(cov)
print("Eigenvalues of covariance matrix:", eigenvalues)
# These are the variances along the principal component directions

Implementation Note: Always use np.linalg.eigh (not np.linalg.eig) for symmetric matrices. The eigh variant exploits symmetry for faster computation, guarantees real eigenvalues, and returns orthonormal eigenvectors. Using eig on a symmetric matrix wastes computation and can introduce numerical artifacts.

Positive Definiteness

A symmetric matrix $\mathbf{A}$ is positive definite if $\mathbf{x}^\top \mathbf{A} \mathbf{x} > 0$ for all nonzero $\mathbf{x}$. It is positive semi-definite (PSD) if $\mathbf{x}^\top \mathbf{A} \mathbf{x} \geq 0$.

By the spectral theorem, $\mathbf{A}$ is positive definite if and only if all its eigenvalues are strictly positive. This concept appears everywhere:

  • Covariance matrices are PSD (and positive definite when no feature is a linear combination of others)
  • The Hessian of a strictly convex function is positive definite at the minimum
  • Kernel matrices must be PSD for kernel methods to be well-defined

Eigendecomposition and PCA

Here is the explicit connection that makes eigendecomposition indispensable:

Principal Component Analysis computes the eigenvectors of the covariance matrix $\mathbf{C} = \frac{1}{n-1}(\mathbf{X} - \bar{\mathbf{X}})^\top(\mathbf{X} - \bar{\mathbf{X}})$.

The eigenvector with the largest eigenvalue is the direction of maximum variance. The eigenvalue itself is the variance along that direction. Projecting data onto the top $k$ eigenvectors gives the best rank-$k$ approximation to the centered data (in the least-squares sense).

def pca_via_eigendecomposition(X: np.ndarray, n_components: int) -> dict:
    """Compute PCA using eigendecomposition of the covariance matrix.

    Args:
        X: Data matrix of shape (n_samples, n_features).
        n_components: Number of principal components to retain.

    Returns:
        Dictionary with PCA results.
    """
    # Center the data
    mean = X.mean(axis=0)
    X_centered = X - mean

    # Covariance matrix
    n = X.shape[0]
    C = (X_centered.T @ X_centered) / (n - 1)

    # Eigendecomposition (C is symmetric → use eigh)
    eigenvalues, eigenvectors = np.linalg.eigh(C)

    # Sort descending
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]

    # Select top k components
    components = eigenvectors[:, :n_components]  # (d, k)

    # Project data
    X_projected = X_centered @ components  # (n, k)

    # Explained variance
    total_variance = eigenvalues.sum()
    explained_variance_ratio = eigenvalues[:n_components] / total_variance

    return {
        "components": components,
        "projected": X_projected,
        "eigenvalues": eigenvalues[:n_components],
        "explained_variance_ratio": explained_variance_ratio,
        "cumulative_explained_variance": np.cumsum(explained_variance_ratio),
    }

# Demo: 5D data with 2 dominant directions
rng = np.random.default_rng(42)
X = rng.standard_normal((500, 2)) @ rng.standard_normal((2, 5))
X += 0.1 * rng.standard_normal((500, 5))  # Add noise

result = pca_via_eigendecomposition(X, n_components=3)
print("Explained variance ratios:", result["explained_variance_ratio"])
# Expected: first two components capture ~99% of variance
print("Cumulative:", result["cumulative_explained_variance"])

We will see in the next section that SVD provides a more numerically stable and more general route to PCA — but the eigendecomposition derivation is the one that reveals why PCA works.


1.4 — Singular Value Decomposition: The Fundamental Theorem of Data Science

If you learn one decomposition from this chapter, it should be the SVD. It applies to any matrix (not just square matrices), it reveals the rank, the range, the null space, the best low-rank approximation, and the condition number — all in a single computation.

The Decomposition

Mathematical Foundation: Singular Value Decomposition. Every matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ can be factored as:

$$\mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^\top$$

where: - $\mathbf{U} \in \mathbb{R}^{m \times m}$ is orthogonal: the left singular vectors - $\mathbf{\Sigma} \in \mathbb{R}^{m \times n}$ is diagonal (in the generalized sense): the singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0 = \cdots = 0$, where $r = \text{rank}(\mathbf{A})$ - $\mathbf{V} \in \mathbb{R}^{n \times n}$ is orthogonal: the right singular vectors

Derivation from eigendecomposition. The SVD is intimately connected to the eigendecomposition of two symmetric matrices:

  • $\mathbf{A}^\top \mathbf{A} = \mathbf{V} \mathbf{\Sigma}^\top \mathbf{\Sigma} \mathbf{V}^\top$: the eigenvectors of $\mathbf{A}^\top \mathbf{A}$ are the right singular vectors, and the eigenvalues are $\sigma_i^2$.
  • $\mathbf{A} \mathbf{A}^\top = \mathbf{U} \mathbf{\Sigma} \mathbf{\Sigma}^\top \mathbf{U}^\top$: the eigenvectors of $\mathbf{A} \mathbf{A}^\top$ are the left singular vectors.

Both $\mathbf{A}^\top \mathbf{A}$ and $\mathbf{A} \mathbf{A}^\top$ are symmetric and positive semi-definite, so the spectral theorem applies. The singular values are the square roots of the (shared, nonzero) eigenvalues.

Proof sketch. Let $\mathbf{A}^\top \mathbf{A} = \mathbf{V} \mathbf{D} \mathbf{V}^\top$ be the eigendecomposition, with $\mathbf{D} = \text{diag}(d_1, \ldots, d_n)$ and $d_i \geq 0$ (since $\mathbf{A}^\top \mathbf{A}$ is PSD). Set $\sigma_i = \sqrt{d_i}$ for $i = 1, \ldots, r$ where $r$ is the number of positive eigenvalues. For each $i \leq r$, define $\mathbf{u}_i = \frac{1}{\sigma_i} \mathbf{A} \mathbf{v}_i$. Then $\mathbf{A} \mathbf{v}_i = \sigma_i \mathbf{u}_i$ and one can verify that $\{\mathbf{u}_i\}$ are orthonormal. Extend to an orthonormal basis for $\mathbb{R}^m$ to form $\mathbf{U}$.

The Compact (Thin) SVD

In practice, when $m \gg n$ (many more rows than columns), we use the thin SVD: $\mathbf{A} = \mathbf{U}_r \mathbf{\Sigma}_r \mathbf{V}_r^\top$ where $\mathbf{U}_r \in \mathbb{R}^{m \times r}$, $\mathbf{\Sigma}_r \in \mathbb{R}^{r \times r}$, and $\mathbf{V}_r \in \mathbb{R}^{n \times r}$.

def svd_analysis(A: np.ndarray) -> dict:
    """Compute SVD and analyze the matrix structure.

    Args:
        A: Matrix of shape (m, n).

    Returns:
        Dictionary with SVD components and analysis.
    """
    U, sigma, Vt = np.linalg.svd(A, full_matrices=False)

    # Effective rank: number of singular values above a threshold
    tol = max(A.shape) * sigma[0] * np.finfo(A.dtype).eps
    effective_rank = np.sum(sigma > tol)

    # Condition number: ratio of largest to smallest nonzero singular value
    condition_number = sigma[0] / sigma[effective_rank - 1] if effective_rank > 0 else np.inf

    # Energy captured by top-k singular values
    total_energy = np.sum(sigma**2)
    cumulative_energy = np.cumsum(sigma**2) / total_energy

    return {
        "U": U,
        "sigma": sigma,
        "Vt": Vt,
        "rank": effective_rank,
        "condition_number": condition_number,
        "singular_values": sigma,
        "cumulative_energy": cumulative_energy,
    }

# Example: 100x50 matrix with effective rank 5
rng = np.random.default_rng(42)
A = rng.standard_normal((100, 5)) @ rng.standard_normal((5, 50))
A += 0.01 * rng.standard_normal((100, 50))  # Noise

result = svd_analysis(A)
print(f"Effective rank: {result['rank']}")
print(f"Top 5 singular values: {result['singular_values'][:5].round(2)}")
print(f"Energy in top 5: {result['cumulative_energy'][4]:.4f}")

The Eckart-Young Theorem: Best Low-Rank Approximation

This is arguably the most important theorem in applied data science.

Mathematical Foundation: Eckart-Young-Mirsky Theorem. The best rank-$k$ approximation to $\mathbf{A}$ in both the Frobenius norm and the spectral (operator) norm is given by the truncated SVD:

$$\mathbf{A}_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top = \mathbf{U}_k \mathbf{\Sigma}_k \mathbf{V}_k^\top$$

The approximation error is:

$$\|\mathbf{A} - \mathbf{A}_k\|_F = \sqrt{\sum_{i=k+1}^r \sigma_i^2}$$

This is why the singular value spectrum matters: it tells you how well a matrix can be compressed. If the singular values decay rapidly, a low-rank approximation captures most of the information. If they decay slowly, the data is intrinsically high-dimensional.

def low_rank_approximation(
    A: np.ndarray,
    k: int
) -> tuple[np.ndarray, float]:
    """Compute the best rank-k approximation via truncated SVD.

    Args:
        A: Matrix of shape (m, n).
        k: Target rank.

    Returns:
        Tuple of (A_k, relative_error) where A_k is the rank-k approximation.
    """
    U, sigma, Vt = np.linalg.svd(A, full_matrices=False)

    # Truncate to rank k
    A_k = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]

    # Approximation error
    error = np.sqrt(np.sum(sigma[k:]**2))
    relative_error = error / np.sqrt(np.sum(sigma**2))

    return A_k, relative_error

# Demonstrate: how many singular values do we need?
rng = np.random.default_rng(42)
A = rng.standard_normal((200, 100))

print("Rank k | Relative Error | % Energy Captured")
print("-" * 50)
for k in [1, 5, 10, 20, 50, 90]:
    _, rel_err = low_rank_approximation(A, k)
    energy = 1 - rel_err**2
    print(f"  {k:>4d}  |    {rel_err:.4f}      |     {energy*100:.1f}%")

SVD and PCA: The Connection Made Explicit

PCA can be computed via SVD of the centered data matrix, bypassing the explicit construction of the covariance matrix. This is the numerically preferred approach.

Let $\tilde{\mathbf{X}} = \mathbf{X} - \bar{\mathbf{X}}$ be the centered data matrix ($n \times d$). Then:

$$\tilde{\mathbf{X}} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^\top$$

The principal components are the columns of $\mathbf{V}$ (the right singular vectors). The projected data is $\tilde{\mathbf{X}} \mathbf{V} = \mathbf{U} \mathbf{\Sigma}$. The eigenvalues of the covariance matrix are $\lambda_i = \sigma_i^2 / (n - 1)$.

Common Misconception: "PCA requires computing the covariance matrix." Computing $\mathbf{X}^\top \mathbf{X}$ squares the condition number: if $\kappa(\mathbf{X}) = 10^4$, then $\kappa(\mathbf{X}^\top \mathbf{X}) = 10^8$. This means the eigendecomposition of the covariance matrix can lose up to 8 digits of precision in double precision arithmetic ($\sim$16 digits). The SVD of $\mathbf{X}$ directly is numerically superior and is what sklearn.decomposition.PCA actually computes.

def pca_via_svd(X: np.ndarray, n_components: int) -> dict:
    """Compute PCA using SVD (numerically stable).

    This is how scikit-learn's PCA actually works internally.

    Args:
        X: Data matrix of shape (n_samples, n_features).
        n_components: Number of components to retain.

    Returns:
        Dictionary with PCA results.
    """
    n = X.shape[0]
    mean = X.mean(axis=0)
    X_centered = X - mean

    # Thin SVD
    U, sigma, Vt = np.linalg.svd(X_centered, full_matrices=False)

    # Principal components are rows of Vt (columns of V)
    components = Vt[:n_components, :]  # (k, d)

    # Projected data: X_centered @ V = U @ Sigma
    X_projected = U[:, :n_components] * sigma[:n_components]  # (n, k)

    # Eigenvalues of covariance matrix
    eigenvalues = sigma[:n_components]**2 / (n - 1)

    # Explained variance ratio
    total_variance = np.sum(sigma**2) / (n - 1)
    explained_variance_ratio = eigenvalues / total_variance

    return {
        "components": components,
        "projected": X_projected,
        "explained_variance_ratio": explained_variance_ratio,
        "singular_values": sigma[:n_components],
    }

SVD Reveals the Four Fundamental Subspaces

The SVD gives us all four fundamental subspaces simultaneously:

  • Column space $C(\mathbf{A})$: spanned by the first $r$ columns of $\mathbf{U}$ (corresponding to nonzero singular values)
  • Left null space $N(\mathbf{A}^\top)$: spanned by the remaining $m - r$ columns of $\mathbf{U}$
  • Row space $C(\mathbf{A}^\top)$: spanned by the first $r$ columns of $\mathbf{V}$ (first $r$ rows of $\mathbf{V}^\top$)
  • Null space $N(\mathbf{A})$: spanned by the remaining $n - r$ columns of $\mathbf{V}$

This is one of the deepest results in numerical linear algebra: a single computation reveals the complete geometric structure of a matrix.

Norms, Trace, and Determinant

Three scalar summaries of a matrix appear constantly in ML:

The Frobenius norm $\|\mathbf{A}\|_F = \sqrt{\sum_{i,j} a_{ij}^2} = \sqrt{\text{tr}(\mathbf{A}^\top \mathbf{A})} = \sqrt{\sum_i \sigma_i^2}$ measures the total "energy" of a matrix. It equals the $\ell_2$ norm of the vector of singular values.

The trace $\text{tr}(\mathbf{A}) = \sum_i a_{ii} = \sum_i \lambda_i$ is the sum of diagonal elements, which equals the sum of eigenvalues. It is invariant under cyclic permutations: $\text{tr}(\mathbf{ABC}) = \text{tr}(\mathbf{CAB}) = \text{tr}(\mathbf{BCA})$. This identity appears constantly in deriving matrix calculus expressions.

The determinant $\det(\mathbf{A}) = \prod_i \lambda_i$ is the product of eigenvalues. It measures the volume scaling factor of the linear transformation. In ML, it appears in the normalization constant of multivariate Gaussian distributions: $(2\pi)^{-n/2} |\det(\mathbf{\Sigma})|^{-1/2} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \mathbf{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\right)$.

# The three scalar summaries
A = rng.standard_normal((5, 5))
A = A + A.T  # Make symmetric for real eigenvalues

eigenvalues = np.linalg.eigvalsh(A)
sigma = np.linalg.svd(A, compute_uv=False)

print(f"Frobenius norm: {np.linalg.norm(A, 'fro'):.4f}")
print(f"  = sqrt(sum sigma_i^2): {np.sqrt(np.sum(sigma**2)):.4f}")

print(f"Trace: {np.trace(A):.4f}")
print(f"  = sum eigenvalues: {np.sum(eigenvalues):.4f}")

print(f"Determinant: {np.linalg.det(A):.4f}")
print(f"  = prod eigenvalues: {np.prod(eigenvalues):.4f}")

1.5 — Matrix Factorization for Collaborative Filtering

The connection to machine learning is immediate and powerful. Let us connect SVD to the Content Platform Recommender — the progressive project that will thread through this entire book.

The Recommendation Problem as Matrix Completion

StreamRec has 5 million users and 200,000 content items. Each user interacts with a tiny fraction of items (watches, rates, or skips). We can represent this as a matrix $\mathbf{R} \in \mathbb{R}^{m \times n}$ where $R_{ij}$ is user $i$'s rating of item $j$. Most entries are missing — the matrix is extremely sparse.

Research Insight: The foundational paper for matrix factorization in recommender systems is Koren, Bell, and Volinsky (2009), "Matrix Factorization Techniques for Recommender Systems," published in IEEE Computer. Their approach, which won the Netflix Prize, models the rating matrix as the product of two low-rank factor matrices: $\mathbf{R} \approx \mathbf{P} \mathbf{Q}^\top$, where $\mathbf{P} \in \mathbb{R}^{m \times k}$ encodes user preferences and $\mathbf{Q} \in \mathbb{R}^{n \times k}$ encodes item characteristics in a shared latent space of dimension $k$.

The intuition: if a user's tastes and an item's characteristics can each be described by $k$ latent factors (e.g., genre preference, production quality, length preference), then the rating is the dot product of these factor vectors. SVD provides the mathematically optimal factorization.

def svd_recommendation(
    R: np.ndarray,
    k: int,
    mask: np.ndarray
) -> tuple[np.ndarray, dict]:
    """Simple SVD-based recommendation via low-rank approximation.

    This is the baseline approach. Production systems use iterative methods
    (ALS, SGD) that handle missing values directly — see Chapter 2's
    progressive project milestone.

    Args:
        R: Rating matrix of shape (n_users, n_items). Missing entries are 0.
        k: Number of latent factors.
        mask: Binary matrix indicating observed entries.

    Returns:
        Tuple of (predicted_ratings, analysis_dict).
    """
    # Fill missing with row means (simple imputation)
    row_means = np.where(mask.sum(axis=1, keepdims=True) > 0,
                         (R * mask).sum(axis=1, keepdims=True) /
                         np.maximum(mask.sum(axis=1, keepdims=True), 1),
                         0.0)
    R_filled = np.where(mask, R, row_means)

    # SVD
    U, sigma, Vt = np.linalg.svd(R_filled, full_matrices=False)

    # Truncate to rank k
    R_hat = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]

    # Analyze singular value spectrum
    total_energy = np.sum(sigma**2)
    cumulative_energy = np.cumsum(sigma**2) / total_energy

    analysis = {
        "singular_values": sigma,
        "cumulative_energy": cumulative_energy,
        "energy_at_k": cumulative_energy[k - 1],
        "rmse_observed": np.sqrt(np.mean((R_hat[mask] - R[mask])**2)),
        "sparsity": 1 - mask.sum() / mask.size,
    }

    return R_hat, analysis

Production Reality: The naive SVD approach above has a critical flaw: it imputes missing values before decomposition, which contaminates the factorization. Production recommender systems use alternating least squares (ALS) or stochastic gradient descent (SGD) to minimize a loss function defined only over observed entries: $\min_{\mathbf{P}, \mathbf{Q}} \sum_{(i,j) \in \Omega} (R_{ij} - \mathbf{p}_i^\top \mathbf{q}_j)^2 + \lambda(\|\mathbf{P}\|_F^2 + \|\mathbf{Q}\|_F^2)$. We implement this properly in Chapter 2's progressive project.


1.6 — Orthogonal Matrices and Geometric Transformations

An orthogonal matrix $\mathbf{Q} \in \mathbb{R}^{n \times n}$ satisfies $\mathbf{Q}^\top \mathbf{Q} = \mathbf{Q} \mathbf{Q}^\top = \mathbf{I}$. Equivalently, its columns (and rows) form an orthonormal basis for $\mathbb{R}^n$.

Key properties: - $\mathbf{Q}^{-1} = \mathbf{Q}^\top$ (the inverse is free to compute) - $\|\mathbf{Qx}\|_2 = \|\mathbf{x}\|_2$ (orthogonal transformations preserve lengths) - $\det(\mathbf{Q}) = \pm 1$ ($+1$ for rotations, $-1$ for reflections) - All eigenvalues have magnitude 1

Orthogonal matrices appear in SVD ($\mathbf{U}$ and $\mathbf{V}$), in the spectral theorem ($\mathbf{Q}$), in QR decomposition, and in Householder reflections. They are the "nice" transformations: they rotate and reflect but do not stretch or compress.

In deep learning, orthogonal initialization of weight matrices helps prevent gradient explosion and vanishing at initialization. The intuition: if $\mathbf{W}$ is orthogonal, then $\|\mathbf{Wx}\| = \|\mathbf{x}\|$, so signals propagate through layers without amplification or attenuation.

# Generate a random orthogonal matrix via QR decomposition
def random_orthogonal(n: int, rng: np.random.Generator) -> np.ndarray:
    """Generate a uniformly random orthogonal matrix (Haar measure).

    Args:
        n: Matrix dimension.
        rng: NumPy random generator.

    Returns:
        Orthogonal matrix of shape (n, n).
    """
    H = rng.standard_normal((n, n))
    Q, R = np.linalg.qr(H)
    # Ensure uniqueness: multiply columns by sign of diagonal of R
    Q = Q @ np.diag(np.sign(np.diag(R)))
    return Q

Q = random_orthogonal(5, rng)
print("Q^T Q ≈ I:", np.allclose(Q.T @ Q, np.eye(5)))
print("det(Q):", np.linalg.det(Q).round(4))

# Orthogonal transformations preserve norms
x = rng.standard_normal(5)
print(f"||x|| = {np.linalg.norm(x):.4f}")
print(f"||Qx|| = {np.linalg.norm(Q @ x):.4f}")

1.7 — Matrix Calculus: The Language Backpropagation Speaks

Every time you call loss.backward() in PyTorch, the framework computes derivatives of a scalar loss with respect to matrices and vectors of parameters. Understanding what it computes — and how — requires matrix calculus.

Derivatives of Scalar Functions with Respect to Vectors

Let $f: \mathbb{R}^n \to \mathbb{R}$ be a scalar-valued function of a vector $\mathbf{x}$. The gradient is the vector of partial derivatives:

$$\nabla_\mathbf{x} f = \frac{\partial f}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} \in \mathbb{R}^n$$

Essential identities:

Function $f(\mathbf{x})$ Gradient $\nabla_\mathbf{x} f$
$\mathbf{a}^\top \mathbf{x}$ $\mathbf{a}$
$\mathbf{x}^\top \mathbf{A} \mathbf{x}$ $(\mathbf{A} + \mathbf{A}^\top)\mathbf{x}$
$\mathbf{x}^\top \mathbf{x} = \|\mathbf{x}\|_2^2$ $2\mathbf{x}$
$\|\mathbf{Ax} - \mathbf{b}\|_2^2$ $2\mathbf{A}^\top(\mathbf{Ax} - \mathbf{b})$

Mathematical Foundation: Deriving the OLS gradient. The ordinary least squares loss is $L(\mathbf{w}) = \|\mathbf{Xw} - \mathbf{y}\|_2^2 = (\mathbf{Xw} - \mathbf{y})^\top (\mathbf{Xw} - \mathbf{y})$.

Expanding: $L = \mathbf{w}^\top \mathbf{X}^\top \mathbf{X} \mathbf{w} - 2\mathbf{y}^\top \mathbf{X} \mathbf{w} + \mathbf{y}^\top \mathbf{y}$

Taking the gradient with respect to $\mathbf{w}$:

$$\nabla_\mathbf{w} L = 2\mathbf{X}^\top \mathbf{X} \mathbf{w} - 2\mathbf{X}^\top \mathbf{y}$$

Setting to zero: $\mathbf{X}^\top \mathbf{X} \mathbf{w}^* = \mathbf{X}^\top \mathbf{y}$ — the normal equations.

The solution $\mathbf{w}^* = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}$ is the pseudoinverse projection. The condition number of $\mathbf{X}^\top \mathbf{X}$ determines numerical stability — and this is why regularization (adding $\lambda \mathbf{I}$) helps.

def ols_gradient(X: np.ndarray, y: np.ndarray, w: np.ndarray) -> np.ndarray:
    """Compute the gradient of the OLS loss.

    L(w) = ||Xw - y||^2
    ∇L = 2 X^T (Xw - y)

    Args:
        X: Data matrix (n, d).
        y: Target vector (n,).
        w: Weight vector (d,).

    Returns:
        Gradient vector (d,).
    """
    residual = X @ w - y
    return 2 * X.T @ residual

# Verify: gradient is zero at the OLS solution
rng = np.random.default_rng(42)
X = rng.standard_normal((100, 5))
y = X @ np.array([1.0, 2.0, 3.0, -1.0, 0.5]) + 0.1 * rng.standard_normal(100)

# Closed-form solution
w_star = np.linalg.solve(X.T @ X, X.T @ y)

# Gradient at the solution should be ~zero
grad_at_optimum = ols_gradient(X, y, w_star)
print(f"||gradient at w*||: {np.linalg.norm(grad_at_optimum):.2e}")
# Should be ~1e-13 (numerical precision)

The Jacobian Matrix

When the function is vector-valued, $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$, the derivative is the Jacobian matrix:

$$\mathbf{J} = \frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{m \times n}$$

Row $i$ of the Jacobian is the gradient of the $i$-th output with respect to all inputs. The Jacobian is the matrix that best approximates $\mathbf{f}$ locally: $\mathbf{f}(\mathbf{x} + \boldsymbol{\delta}) \approx \mathbf{f}(\mathbf{x}) + \mathbf{J} \boldsymbol{\delta}$.

In neural networks, each layer computes a function $\mathbf{f}_l: \mathbb{R}^{n_{l-1}} \to \mathbb{R}^{n_l}$. Backpropagation computes the product of Jacobians:

$$\frac{\partial L}{\partial \mathbf{x}_0} = \frac{\partial L}{\partial \mathbf{x}_L} \cdot \mathbf{J}_L \cdot \mathbf{J}_{L-1} \cdots \mathbf{J}_1$$

If any Jacobian has singular values consistently greater than 1, gradients explode. If consistently less than 1, they vanish. This is why understanding singular values matters for deep learning.

def numerical_jacobian(
    f: callable,
    x: np.ndarray,
    eps: float = 1e-7
) -> np.ndarray:
    """Compute the Jacobian matrix via finite differences.

    Args:
        f: Function mapping R^n -> R^m.
        x: Point at which to evaluate the Jacobian.
        eps: Finite difference step size.

    Returns:
        Jacobian matrix of shape (m, n).
    """
    f0 = f(x)
    m = len(f0)
    n = len(x)
    J = np.zeros((m, n))

    for j in range(n):
        x_plus = x.copy()
        x_plus[j] += eps
        x_minus = x.copy()
        x_minus[j] -= eps
        J[:, j] = (f(x_plus) - f(x_minus)) / (2 * eps)

    return J

# Example: Jacobian of a linear transformation f(x) = Ax is A
A = rng.standard_normal((3, 4))
f_linear = lambda x: A @ x
x = rng.standard_normal(4)

J = numerical_jacobian(f_linear, x)
print("Jacobian ≈ A:", np.allclose(J, A, atol=1e-5))

The Hessian Matrix

The Hessian is the matrix of second-order partial derivatives of a scalar function:

$$\mathbf{H} = \nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}$$

The Hessian is symmetric (by Clairaut's theorem, assuming sufficient smoothness). It encodes the curvature of the loss landscape:

  • Positive definite Hessian → local minimum (the loss curves upward in every direction)
  • Negative definite Hessian → local maximum
  • Indefinite Hessian (both positive and negative eigenvalues) → saddle point
  • Eigenvalues of the Hessian determine the conditioning of the optimization problem: the ratio $\lambda_{\max} / \lambda_{\min}$ is the condition number, and it controls how fast gradient descent converges

Research Insight: Sagun et al. (2017) and Ghorbani et al. (2019) showed that the Hessian of deep network loss functions has a characteristic "bulk + outliers" spectrum: a large cluster of eigenvalues near zero (flat directions) and a few large positive eigenvalues (sharp directions). This structure explains why SGD with moderate learning rates generalizes well — it avoids the sharp minima associated with the large eigenvalues.

Derivatives with Respect to Matrices

For a scalar function $f: \mathbb{R}^{m \times n} \to \mathbb{R}$, the derivative with respect to a matrix $\mathbf{W}$ is a matrix of the same shape:

$$\frac{\partial f}{\partial \mathbf{W}} = \begin{bmatrix} \frac{\partial f}{\partial W_{11}} & \cdots & \frac{\partial f}{\partial W_{1n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial W_{m1}} & \cdots & \frac{\partial f}{\partial W_{mn}} \end{bmatrix}$$

Essential identities for ML:

Function Derivative w.r.t. $\mathbf{W}$
$\text{tr}(\mathbf{AW})$ $\mathbf{A}^\top$
$\text{tr}(\mathbf{W}^\top \mathbf{A})$ $\mathbf{A}$
$\text{tr}(\mathbf{W}^\top \mathbf{A} \mathbf{W})$ $(\mathbf{A} + \mathbf{A}^\top)\mathbf{W}$
$\|\mathbf{XW} - \mathbf{Y}\|_F^2$ $2\mathbf{X}^\top(\mathbf{XW} - \mathbf{Y})$

Mathematical Foundation: Deriving the gradient for a neural network layer. Consider a single linear layer: $\mathbf{Z} = \mathbf{XW} + \mathbf{b}$ followed by activation $\mathbf{A} = \phi(\mathbf{Z})$, where $\mathbf{X} \in \mathbb{R}^{n \times d}$, $\mathbf{W} \in \mathbb{R}^{d \times h}$, $\mathbf{b} \in \mathbb{R}^{1 \times h}$.

Given the upstream gradient $\frac{\partial L}{\partial \mathbf{A}} \in \mathbb{R}^{n \times h}$ (from the next layer's backpropagation), we need $\frac{\partial L}{\partial \mathbf{W}}$ and $\frac{\partial L}{\partial \mathbf{X}}$:

$$\frac{\partial L}{\partial \mathbf{Z}} = \frac{\partial L}{\partial \mathbf{A}} \odot \phi'(\mathbf{Z})$$

$$\frac{\partial L}{\partial \mathbf{W}} = \mathbf{X}^\top \frac{\partial L}{\partial \mathbf{Z}}$$

$$\frac{\partial L}{\partial \mathbf{X}} = \frac{\partial L}{\partial \mathbf{Z}} \mathbf{W}^\top$$

This is backpropagation for one layer: multiply the upstream gradient by the transpose of the forward weight matrix.

def linear_layer_forward_backward(
    X: np.ndarray,
    W: np.ndarray,
    b: np.ndarray,
    dL_dA: np.ndarray,
) -> dict:
    """Forward and backward pass for a linear layer with ReLU.

    Forward: Z = XW + b, A = ReLU(Z)
    Backward: Compute gradients w.r.t. W, b, and X.

    Args:
        X: Input (n, d).
        W: Weights (d, h).
        b: Bias (1, h).
        dL_dA: Upstream gradient (n, h).

    Returns:
        Dictionary with forward outputs and gradients.
    """
    # Forward
    Z = X @ W + b
    A = np.maximum(Z, 0)  # ReLU

    # Backward
    dL_dZ = dL_dA * (Z > 0).astype(float)  # ReLU derivative
    dL_dW = X.T @ dL_dZ                      # (d, h)
    dL_db = dL_dZ.sum(axis=0, keepdims=True)  # (1, h)
    dL_dX = dL_dZ @ W.T                      # (n, d)

    return {
        "Z": Z, "A": A,
        "dL_dW": dL_dW, "dL_db": dL_db, "dL_dX": dL_dX,
    }

1.8 — Kronecker Products and Tensors

The Kronecker Product

The Kronecker product $\mathbf{A} \otimes \mathbf{B}$ of matrices $\mathbf{A} \in \mathbb{R}^{m \times n}$ and $\mathbf{B} \in \mathbb{R}^{p \times q}$ is the $mp \times nq$ block matrix:

$$\mathbf{A} \otimes \mathbf{B} = \begin{bmatrix} a_{11}\mathbf{B} & \cdots & a_{1n}\mathbf{B} \\ \vdots & \ddots & \vdots \\ a_{m1}\mathbf{B} & \cdots & a_{mn}\mathbf{B} \end{bmatrix}$$

Key properties: - $(\mathbf{A} \otimes \mathbf{B})(\mathbf{C} \otimes \mathbf{D}) = (\mathbf{AC}) \otimes (\mathbf{BD})$ (mixed-product property) - $(\mathbf{A} \otimes \mathbf{B})^{-1} = \mathbf{A}^{-1} \otimes \mathbf{B}^{-1}$ - Eigenvalues of $\mathbf{A} \otimes \mathbf{B}$ are all products $\lambda_i(\mathbf{A}) \cdot \lambda_j(\mathbf{B})$

The Kronecker product arises in: - Vectorizing matrix equations: $\text{vec}(\mathbf{AXB}) = (\mathbf{B}^\top \otimes \mathbf{A}) \text{vec}(\mathbf{X})$ - Multi-task learning (shared representations across tasks) - Efficient attention mechanisms (Kronecker factored approximations to Fisher information matrices, known as K-FAC)

# Kronecker product
A = np.array([[1, 2], [3, 4]])
B = np.array([[0, 5], [6, 7]])

K = np.kron(A, B)
print("A ⊗ B:\n", K)
print(f"Shape: {A.shape} ⊗ {B.shape} = {K.shape}")

Advanced Sidebar: The K-FAC optimizer (Martens and Grosse, 2015) approximates the Fisher information matrix of a neural network as a Kronecker product of two smaller matrices — one for the input activations and one for the output gradients. This makes second-order optimization tractable for large networks: instead of inverting an $n \times n$ matrix (where $n$ is the total number of parameters), you invert two much smaller matrices. The Kronecker approximation works because the Fisher for a linear layer has an exact Kronecker structure when the input and gradient are independent.

Tensors

A tensor is a multi-dimensional array. In the mathematical sense, a tensor of order $k$ is an element of the tensor product of $k$ vector spaces. In the computational sense (numpy and PyTorch), it is an array with $k$ indices.

Order Name ML Example
0 Scalar Loss value
1 Vector Feature vector, bias
2 Matrix Weight matrix, covariance matrix, data batch
3 3-tensor Color image batch (batch × channels × pixels), word embeddings in a sequence
4 4-tensor CNN input (batch × channels × height × width)

The Climate Deep Learning anchor example represents global temperature data as a 3-tensor $\mathcal{T} \in \mathbb{R}^{L \times G \times T}$ (latitude × longitude × time). Tensor decomposition methods generalize matrix decomposition to extract spatial and temporal patterns — we develop this in Case Study 2.

import torch

# Tensor operations in PyTorch mirror numpy
# but add autograd for automatic differentiation

# A 3-tensor: batch of 32 sequences, each 50 tokens, each 768-dimensional
embeddings = torch.randn(32, 50, 768)
print(f"Shape: {embeddings.shape}")
print(f"Total elements: {embeddings.numel():,}")  # 1,228,800

# Matrix multiplication along specific axes: einsum
W = torch.randn(768, 256)
# Project each embedding: (32, 50, 768) @ (768, 256) -> (32, 50, 256)
projected = torch.einsum("bsh,hd->bsd", embeddings, W)
print(f"Projected shape: {projected.shape}")

# Equivalent using @ (matmul broadcasts over batch dimensions)
projected_alt = embeddings @ W
assert torch.allclose(projected, projected_alt, atol=1e-5)

1.9 — From numpy to PyTorch: The Autograd Bridge

Everything we have done in numpy, PyTorch can do with automatic differentiation. The transition is nearly seamless.

import torch

# numpy -> PyTorch mapping
# np.array        -> torch.tensor
# np.linalg.svd   -> torch.linalg.svd
# np.linalg.eigh  -> torch.linalg.eigh
# A @ B           -> A @ B (identical syntax)
# np.linalg.norm  -> torch.linalg.norm

def svd_comparison():
    """Demonstrate numpy-to-PyTorch equivalence for SVD."""
    rng = np.random.default_rng(42)
    A_np = rng.standard_normal((50, 30))
    A_pt = torch.from_numpy(A_np)

    # numpy SVD
    U_np, s_np, Vt_np = np.linalg.svd(A_np, full_matrices=False)

    # PyTorch SVD
    U_pt, s_pt, Vt_pt = torch.linalg.svd(A_pt, full_matrices=False)

    # Values should match (up to sign conventions)
    assert np.allclose(s_np, s_pt.numpy(), atol=1e-10)
    print("Singular values match between numpy and PyTorch")

svd_comparison()

The critical difference: PyTorch tensors can track gradients.

def autograd_demo():
    """Demonstrate PyTorch autograd computing gradients automatically."""
    # Create tensors with gradient tracking
    X = torch.randn(100, 5)
    y = torch.randn(100)
    W = torch.randn(5, requires_grad=True)

    # Forward pass: compute loss
    predictions = X @ W
    loss = ((predictions - y) ** 2).mean()

    # Backward pass: compute gradients
    loss.backward()

    # W.grad now contains dL/dW
    autograd_gradient = W.grad.clone()

    # Verify against our manual formula: dL/dW = (2/n) * X^T @ (Xw - y)
    with torch.no_grad():
        manual_gradient = (2.0 / len(y)) * X.T @ (X @ W - y)

    print(f"Max difference: {(autograd_gradient - manual_gradient).abs().max():.2e}")
    # Should be ~1e-6 or smaller

autograd_demo()

Implementation Note: When converting between numpy and PyTorch, watch for two common pitfalls. First, torch.from_numpy() shares memory with the original array — modifying one modifies the other. Use .clone() if you need independence. Second, PyTorch uses row-major (C-contiguous) storage by default, same as numpy, but some operations (e.g., transpose) create non-contiguous views. Call .contiguous() before passing to CUDA kernels that require contiguous memory.

Autograd and the Jacobian

PyTorch's autograd computes exactly the Jacobian-vector products we derived by hand. For a scalar loss $L$ and parameters $\mathbf{W}$, loss.backward() computes $\frac{\partial L}{\partial \mathbf{W}}$ — the same gradient we derived in Section 1.7.

For non-scalar outputs, torch.autograd.functional.jacobian computes the full Jacobian:

def jacobian_with_autograd():
    """Compute the full Jacobian using PyTorch autograd."""
    def f(x: torch.Tensor) -> torch.Tensor:
        """A simple nonlinear function R^3 -> R^2."""
        return torch.stack([
            x[0]**2 + x[1]*x[2],
            torch.sin(x[0]) + x[2]**3
        ])

    x = torch.tensor([1.0, 2.0, 3.0])

    # Full Jacobian via autograd
    J = torch.autograd.functional.jacobian(f, x)
    print("Jacobian (autograd):\n", J)

    # Manual Jacobian for verification
    # df1/dx = [2*x0, x2, x1] = [2, 3, 2]
    # df2/dx = [cos(x0), 0, 3*x2^2] = [cos(1), 0, 27]
    J_manual = torch.tensor([
        [2*x[0], x[2], x[1]],
        [torch.cos(x[0]), 0.0, 3*x[2]**2]
    ])

    print("Jacobian (manual):\n", J_manual)
    print(f"Match: {torch.allclose(J, J_manual, atol=1e-5)}")

jacobian_with_autograd()

This is the bridge: the matrix calculus you derive by hand in this chapter is exactly what PyTorch's autograd computes automatically. Understanding the math means you can debug, verify, and optimize what the framework does.


1.10 — Progressive Project: M0 — The StreamRec Matrix

This is the first milestone of the progressive project that threads through the entire book. We define the recommendation problem mathematically and analyze the structure of the data.

Problem Statement

StreamRec is a content streaming platform with approximately 5 million users and 200,000 items (articles, videos, podcasts). Each user interacts with a small subset of items. Our goal: predict which items a user would rate highly, given the ratings we have observed.

Mathematically, we have a sparse rating matrix $\mathbf{R} \in \mathbb{R}^{m \times n}$ and want to find a low-rank approximation $\mathbf{R} \approx \mathbf{P}\mathbf{Q}^\top$ where $\mathbf{P} \in \mathbb{R}^{m \times k}$ and $\mathbf{Q} \in \mathbb{R}^{n \times k}$.

For this milestone, we work with a representative sample.

def generate_streamrec_data(
    n_users: int = 5000,
    n_items: int = 2000,
    n_latent: int = 10,
    sparsity: float = 0.98,
    noise_std: float = 0.5,
    seed: int = 42,
) -> dict:
    """Generate synthetic StreamRec interaction data.

    The data is generated from a low-rank model with added noise,
    simulating the structure of real recommendation datasets.

    Args:
        n_users: Number of users.
        n_items: Number of items.
        n_latent: True number of latent factors.
        sparsity: Fraction of missing entries.
        noise_std: Standard deviation of observation noise.
        seed: Random seed.

    Returns:
        Dictionary with rating matrix, mask, and ground truth.
    """
    rng = np.random.default_rng(seed)

    # True latent factors
    P_true = rng.standard_normal((n_users, n_latent))
    Q_true = rng.standard_normal((n_items, n_latent))

    # True ratings (low-rank structure)
    R_true = P_true @ Q_true.T

    # Scale to 1-5 rating range
    R_true = 3.0 + 1.5 * (R_true - R_true.mean()) / R_true.std()
    R_true = np.clip(R_true, 1, 5)

    # Observation mask (which entries are observed)
    mask = rng.random((n_users, n_items)) > sparsity

    # Observed ratings with noise
    R_observed = np.where(mask, R_true + noise_std * rng.standard_normal(R_true.shape), 0.0)
    R_observed = np.clip(R_observed, 1, 5) * mask

    return {
        "R_observed": R_observed,
        "R_true": R_true,
        "mask": mask,
        "P_true": P_true,
        "Q_true": Q_true,
        "n_latent_true": n_latent,
    }

# Generate data
data = generate_streamrec_data()
R = data["R_observed"]
mask = data["mask"]

print(f"Rating matrix shape: {R.shape}")
print(f"Observed entries: {mask.sum():,} / {mask.size:,}")
print(f"Sparsity: {1 - mask.sum()/mask.size:.2%}")
print(f"Rating range: [{R[mask].min():.1f}, {R[mask].max():.1f}]")
print(f"Mean rating: {R[mask].mean():.2f}")

Analyzing the Singular Value Spectrum

The singular value spectrum reveals the intrinsic dimensionality — how many latent factors are needed to explain the observed ratings.

def analyze_streamrec_svd(R: np.ndarray, mask: np.ndarray) -> dict:
    """Analyze the SVD structure of the StreamRec rating matrix.

    Args:
        R: Observed rating matrix (n_users, n_items).
        mask: Binary observation mask.

    Returns:
        Analysis dictionary with singular values and metrics.
    """
    # Fill missing with global mean (simple baseline)
    global_mean = R[mask].mean()
    R_filled = np.where(mask, R, global_mean)

    # Center
    R_centered = R_filled - R_filled.mean(axis=0, keepdims=True)

    # Compute SVD (thin)
    U, sigma, Vt = np.linalg.svd(R_centered, full_matrices=False)

    # Analyze spectrum
    total_energy = np.sum(sigma**2)
    cumulative_energy = np.cumsum(sigma**2) / total_energy

    # Find effective rank at various thresholds
    thresholds = [0.50, 0.80, 0.90, 0.95, 0.99]
    ranks_at_threshold = {}
    for t in thresholds:
        k = np.searchsorted(cumulative_energy, t) + 1
        ranks_at_threshold[t] = k

    # Spectral gap analysis
    gaps = sigma[:-1] - sigma[1:]
    largest_gap_idx = np.argmax(gaps[:50])  # Look in top 50

    return {
        "singular_values": sigma,
        "cumulative_energy": cumulative_energy,
        "ranks_at_threshold": ranks_at_threshold,
        "largest_gap_position": largest_gap_idx + 1,
        "condition_number": sigma[0] / sigma[-1] if sigma[-1] > 0 else np.inf,
        "top_10_sv": sigma[:10],
    }

analysis = analyze_streamrec_svd(R, mask)

print("Singular Value Analysis")
print("=" * 50)
print(f"\nTop 10 singular values: {analysis['top_10_sv'].round(1)}")
print(f"\nRanks needed for energy thresholds:")
for t, k in analysis['ranks_at_threshold'].items():
    print(f"  {t*100:.0f}%: rank {k}")
print(f"\nLargest spectral gap at position: {analysis['largest_gap_position']}")
print(f"(Should be near the true latent dimension of 10)")

Reconstruction Quality vs. Rank

def evaluate_reconstruction(
    R: np.ndarray,
    mask: np.ndarray,
    R_true: np.ndarray,
    ranks: list[int],
) -> None:
    """Evaluate reconstruction quality at different ranks.

    Args:
        R: Observed rating matrix.
        mask: Observation mask.
        R_true: True underlying ratings.
        ranks: List of ranks to evaluate.
    """
    global_mean = R[mask].mean()
    R_filled = np.where(mask, R, global_mean)
    R_centered = R_filled - R_filled.mean(axis=0, keepdims=True)

    U, sigma, Vt = np.linalg.svd(R_centered, full_matrices=False)
    col_means = R_filled.mean(axis=0, keepdims=True)

    print(f"{'Rank':>6} | {'RMSE (observed)':>16} | {'RMSE (held-out)':>16} | {'Energy %':>10}")
    print("-" * 60)

    held_out_mask = ~mask & (R_true > 0)  # Entries we know but didn't observe

    for k in ranks:
        # Reconstruct
        R_hat = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :] + col_means
        R_hat = np.clip(R_hat, 1, 5)

        # RMSE on observed
        rmse_obs = np.sqrt(np.mean((R_hat[mask] - R[mask])**2))

        # RMSE on held-out (generalization)
        rmse_held = np.sqrt(np.mean((R_hat[held_out_mask] - R_true[held_out_mask])**2))

        energy = np.sum(sigma[:k]**2) / np.sum(sigma**2) * 100

        print(f"{k:>6} | {rmse_obs:>16.4f} | {rmse_held:>16.4f} | {energy:>9.1f}%")

ranks_to_test = [1, 2, 5, 10, 20, 50, 100, 200]
evaluate_reconstruction(R, mask, data["R_true"], ranks_to_test)
# Look for the rank where held-out RMSE starts increasing — that is overfitting

M0 Summary

At this milestone, you should observe:

  1. The rating matrix is extremely sparse ($\sim$98% missing entries).
  2. The singular value spectrum reveals low-rank structure — the first 10-20 singular values capture the majority of the signal, matching our generative model.
  3. There is a clear overfitting pattern — training RMSE decreases monotonically with rank, but held-out RMSE reaches a minimum around the true latent dimension and increases thereafter.
  4. Simple SVD on the imputed matrix is a reasonable baseline but has fundamental limitations: the imputation step biases the decomposition. In Chapter 2, we will implement proper matrix factorization via SGD, optimizing only over observed entries.

1.11 — Closing Reflection

This chapter covered the linear algebra that powers every model in this book. The concepts are not new — eigendecomposition dates to the 19th century, and SVD was described by Beltrami and Jordan in 1873-1874. What is new is the application: these mathematical structures are the computational substrate of modern machine learning.

Three themes emerged that will recur throughout:

Understanding Why. We did not merely call np.linalg.svd — we derived the decomposition, connected it to eigendecomposition, and implemented it step by step. When a production model produces unexpected results, this depth of understanding is what enables diagnosis. You cannot debug what you do not understand.

Fundamentals > Frontier. The linear algebra in this chapter has not changed in over a century. Transformers, diffusion models, and every architecture that will be invented next year are built on these same operations — matrix multiplications, decompositions, and gradient computations. If you master the vocabulary of this chapter, every new paper becomes readable.

The numpy-PyTorch bridge. Every operation we performed in numpy has an exact analogue in PyTorch, with the addition of automatic differentiation. The gradient expressions we derived by hand in Section 1.7 are precisely what loss.backward() computes. Understanding both sides — the mathematics and the autograd computation — is what separates a practitioner who uses frameworks from one who understands them.

In the next chapter, we build on matrix calculus to derive the backpropagation algorithm, develop the family of gradient descent optimizers, and analyze the loss landscape of the matrix factorization model we began here. The mathematics of this chapter is the foundation; optimization theory is the engine.


Chapter 1 of Advanced Data Science: Deep Learning, Causal Inference, and Production Systems at Scale