34 min read

> "The introduction of numbers as coordinates is an act of violence."

Chapter 2: Linear Algebra for AI

"The introduction of numbers as coordinates is an act of violence." --- Hermann Weyl, Philosophy of Mathematics and Natural Science (1949)

Linear algebra is the backbone of modern artificial intelligence. Every image a neural network classifies is a matrix, every word embedding is a vector, every layer of a transformer applies a linear transformation, and every principal component analysis relies on eigendecomposition. If calculus (which we explored in Chapter 1) tells us how models learn, linear algebra tells us what they learn with---the very representation of data and the operations that reshape it.

This chapter builds your fluency in the linear algebra that AI engineers use daily. We begin with vectors and vector spaces---the containers for data---then move to matrices as the engines of transformation. From there, we will climb to eigendecomposition and singular value decomposition (SVD), two factorizations that unlock dimensionality reduction, recommendation systems, and the stability analysis of training algorithms. We close with a preview of matrix calculus, which bridges this chapter to the optimization techniques we will explore in Chapter 3.

Throughout, every concept follows a consistent pattern: intuition first, then the formal definition, then a worked numerical example, and finally a NumPy implementation you can run immediately. By the end, you will not merely understand linear algebra---you will wield it.


2.1 Vectors and Vector Spaces

2.1.1 What Is a Vector?

Intuitively, a vector is an ordered list of numbers. In AI, vectors are the fundamental unit of data representation. A grayscale pixel row is a vector. A word embedding is a vector. The weights of a single neuron form a vector. Anywhere you find a list of numbers carrying meaning, you find a vector.

Formally, a vector $\mathbf{v}$ in $\mathbb{R}^n$ is an element of $n$-dimensional real space:

$$ \mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} $$

where:

  • $\mathbf{v}$ is the vector (bold lowercase by convention),
  • $v_i$ is the $i$-th component (italic scalar),
  • $n$ is the dimensionality of the space.

In AI engineering, you will encounter vectors with dimensions ranging from 2 or 3 (for visualization) to hundreds of thousands (for language model vocabularies).

Geometric Interpretation. In two or three dimensions, a vector can be visualized as an arrow from the origin to a point. The arrow has both a direction (where it points) and a magnitude (how long it is). While we lose the ability to draw pictures beyond three dimensions, the algebraic properties remain the same, and developing geometric intuition in low dimensions will serve you well when reasoning about high-dimensional spaces. When a word embedding model places "king" and "queen" near each other in 300-dimensional space, the geometric intuition of "closeness" is exactly the same as two arrows pointing in nearly the same direction in 2D---just scaled up.

Historical Note. The concept of a vector space was formalized by Giuseppe Peano in 1888 and further developed by Hermann Grassmann, but the objects themselves had been used by physicists and engineers for decades before that. Today, vectors are so fundamental to computation that every major programming language and numerical library is built around efficient vector operations.

Worked Example. Consider a feature vector representing a house: square footage, number of bedrooms, and age in years.

$$ \mathbf{x} = \begin{bmatrix} 1500 \\ 3 \\ 20 \end{bmatrix} $$

This vector lives in $\mathbb{R}^3$. Each component carries a distinct meaning, and together they encode the house as a single mathematical object.

import numpy as np

# Feature vector: [square_footage, bedrooms, age_years]
x = np.array([1500, 3, 20])
print(f"Vector x: {x}")
print(f"Dimension: {x.shape[0]}")
print(f"Second component (bedrooms): {x[1]}")

2.1.2 Vector Operations

Vector Addition. Adding two vectors combines their components element-wise. Geometrically, this corresponds to placing one arrow at the tip of the other---the "parallelogram rule." If you walk three blocks east and then two blocks north, the resulting displacement is the vector sum of those two movements. This same intuition extends to combining feature vectors in machine learning: adding a bias vector to the output of a linear layer shifts every data point by the same amount in feature space.

$$ \mathbf{u} + \mathbf{v} = \begin{bmatrix} u_1 + v_1 \\ u_2 + v_2 \\ \vdots \\ u_n + v_n \end{bmatrix} $$

where:

  • $\mathbf{u}, \mathbf{v} \in \mathbb{R}^n$ are vectors of the same dimension,
  • addition is performed component-wise.

Scalar Multiplication. Multiplying a vector by a scalar $c$ scales every component:

$$ c\mathbf{v} = \begin{bmatrix} cv_1 \\ cv_2 \\ \vdots \\ cv_n \end{bmatrix} $$

where:

  • $c \in \mathbb{R}$ is a scalar (italic, not bold),
  • the operation stretches ($|c| > 1$), shrinks ($|c| < 1$), or reverses ($c < 0$) the vector.

Worked Example.

$$ \mathbf{u} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}, \quad \mathbf{v} = \begin{bmatrix} 4 \\ 5 \\ 6 \end{bmatrix} $$

$$ \mathbf{u} + \mathbf{v} = \begin{bmatrix} 5 \\ 7 \\ 9 \end{bmatrix}, \quad 3\mathbf{u} = \begin{bmatrix} 3 \\ 6 \\ 9 \end{bmatrix} $$

import numpy as np

u = np.array([1, 2, 3])
v = np.array([4, 5, 6])

print(f"u + v = {u + v}")        # [5 7 9]
print(f"3 * u = {3 * u}")        # [3 6 9]
print(f"u - v = {u - v}")        # [-3 -3 -3]
print(f"0.5 * v = {0.5 * v}")    # [2.0 2.5 3.0]

2.1.3 The Dot Product and Norms

The dot product (inner product) is perhaps the single most important operation in AI. It measures how aligned two vectors are and appears in every forward pass of every neural network. When an attention mechanism computes "how relevant is this key to this query?", it uses a dot product. When a recommendation system computes "how well does this user profile match this item profile?", it uses a dot product. Understanding the dot product deeply is non-negotiable for AI engineering.

The dot product of $\mathbf{u}$ and $\mathbf{v}$ is defined as:

$$ \mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^{n} u_i v_i = u_1 v_1 + u_2 v_2 + \cdots + u_n v_n $$

where:

  • the result is a scalar, not a vector,
  • $\mathbf{u} \cdot \mathbf{v} > 0$ means the vectors point in roughly the same direction,
  • $\mathbf{u} \cdot \mathbf{v} = 0$ means the vectors are orthogonal (perpendicular),
  • $\mathbf{u} \cdot \mathbf{v} < 0$ means they point in roughly opposite directions.

The geometric interpretation connects the dot product to the angle $\theta$ between two vectors:

$$ \mathbf{u} \cdot \mathbf{v} = \|\mathbf{u}\| \|\mathbf{v}\| \cos\theta $$

where:

  • $\|\mathbf{u}\|$ is the Euclidean norm (length) of $\mathbf{u}$,
  • $\theta$ is the angle between the two vectors.

Norms. The norm of a vector measures its "size." The most common norms in AI are:

Norm Formula AI Usage
$L^1$ (Manhattan) $\|\mathbf{v}\|_1 = \sum_i \|v_i\|$ Lasso regularization, sparse models
$L^2$ (Euclidean) $\|\mathbf{v}\|_2 = \sqrt{\sum_i v_i^2}$ Ridge regularization, distance metrics
$L^\infty$ (Max) $\|\mathbf{v}\|_\infty = \max_i \|v_i\|$ Adversarial robustness bounds

Worked Example. Compute the dot product and norms:

$$ \mathbf{u} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}, \quad \mathbf{v} = \begin{bmatrix} 4 \\ -5 \\ 6 \end{bmatrix} $$

$$ \mathbf{u} \cdot \mathbf{v} = (1)(4) + (2)(-5) + (3)(6) = 4 - 10 + 18 = 12 $$

$$ \|\mathbf{u}\|_2 = \sqrt{1^2 + 2^2 + 3^2} = \sqrt{14} \approx 3.742 $$

import numpy as np

u = np.array([1, 2, 3])
v = np.array([4, -5, 6])

# Dot product (three equivalent ways)
dot1 = np.dot(u, v)
dot2 = u @ v
dot3 = np.sum(u * v)
print(f"Dot product: {dot1}")  # 12

# Norms
l2_norm = np.linalg.norm(u)          # Euclidean (default)
l1_norm = np.linalg.norm(u, ord=1)   # Manhattan
linf_norm = np.linalg.norm(u, ord=np.inf)  # Max norm
print(f"L2 norm: {l2_norm:.4f}")     # 3.7417
print(f"L1 norm: {l1_norm:.4f}")     # 6.0000
print(f"L-inf norm: {linf_norm:.4f}")  # 3.0000

# Cosine similarity
cos_sim = np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
print(f"Cosine similarity: {cos_sim:.4f}")  # 0.3653

Cosine Similarity in Practice. Cosine similarity deserves special attention because of its outsized role in modern AI. By normalizing vectors to unit length, cosine similarity measures only the angle between them, ignoring magnitude. This is exactly what you want when comparing word embeddings (a frequent word and a rare word may have very different magnitudes but encode the same concept), document representations, or image features.

$$ \cos(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|} $$

where:

  • the result lies in $[-1, 1]$,
  • $\cos = 1$ means identical direction, $\cos = 0$ means orthogonal, $\cos = -1$ means opposite directions.

Worked Example: Similarity Search. Suppose we have three word embeddings and want to find which pair is most similar:

$$ \mathbf{w}_{\text{cat}} = \begin{bmatrix} 0.7 \\ 0.5 \\ 0.1 \end{bmatrix}, \quad \mathbf{w}_{\text{dog}} = \begin{bmatrix} 0.6 \\ 0.6 \\ 0.1 \end{bmatrix}, \quad \mathbf{w}_{\text{car}} = \begin{bmatrix} 0.1 \\ 0.1 \\ 0.9 \end{bmatrix} $$

import numpy as np

w_cat = np.array([0.7, 0.5, 0.1])
w_dog = np.array([0.6, 0.6, 0.1])
w_car = np.array([0.1, 0.1, 0.9])

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(f"cat-dog similarity: {cosine_sim(w_cat, w_dog):.4f}")  # ~0.9830
print(f"cat-car similarity: {cosine_sim(w_cat, w_car):.4f}")  # ~0.2794
print(f"dog-car similarity: {cosine_sim(w_dog, w_car):.4f}")  # ~0.2556
# As expected, cat and dog are most similar

This pattern---represent objects as vectors, compute similarity via dot products---is the backbone of retrieval-augmented generation (RAG) systems, semantic search engines, and the attention mechanism in transformers, which we will study in Chapter 19.

2.1.4 Vector Spaces, Span, and Linear Independence

A vector space is a set of vectors closed under addition and scalar multiplication. The intuition is that any combination of vectors in the space produces another vector still in the space.

A linear combination of vectors $\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k$ is:

$$ \mathbf{w} = c_1 \mathbf{v}_1 + c_2 \mathbf{v}_2 + \cdots + c_k \mathbf{v}_k $$

where:

  • $c_1, c_2, \ldots, c_k$ are scalars (the coefficients),
  • $\mathbf{w}$ is the resulting vector.

The span of a set of vectors is the set of all possible linear combinations. A set of vectors is linearly independent if no vector in the set can be written as a linear combination of the others. Linearly independent vectors that span an entire space form a basis for that space.

Why does this matter for AI? The rank of a data matrix tells you the true dimensionality of your data. If your 100-dimensional feature vectors actually lie in a 10-dimensional subspace, you can reduce dimensionality without losing information---a fact we will exploit with PCA and SVD later in this chapter.

Worked Example. Check whether three vectors are linearly independent:

$$ \mathbf{v}_1 = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}, \quad \mathbf{v}_2 = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}, \quad \mathbf{v}_3 = \begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix} $$

We see that $\mathbf{v}_3 = \mathbf{v}_1 + \mathbf{v}_2$, so these vectors are linearly dependent. They span a 2D plane within $\mathbb{R}^3$, not all of $\mathbb{R}^3$.

import numpy as np

v1 = np.array([1, 0, 0])
v2 = np.array([0, 1, 0])
v3 = np.array([1, 1, 0])

# Stack as columns of a matrix and compute rank
A = np.column_stack([v1, v2, v3])
rank = np.linalg.matrix_rank(A)
print(f"Matrix:\n{A}")
print(f"Rank: {rank}")  # 2, not 3 -> linearly dependent

# Verify: v3 = v1 + v2
print(f"v1 + v2 = {v1 + v2}")       # [1 1 0] = v3
print(f"Linearly independent: {rank == 3}")  # False

2.1.5 Orthogonality and Projections

Two vectors are orthogonal if their dot product is zero. Orthogonality is the generalization of perpendicularity to higher dimensions, and it is fundamental to AI: orthogonal features carry independent information, orthogonal weight matrices preserve norms during forward passes, and the Gram-Schmidt process lets us build orthonormal bases.

The projection of $\mathbf{u}$ onto $\mathbf{v}$ answers the question: "How much of $\mathbf{u}$ lies in the direction of $\mathbf{v}$?"

$$ \text{proj}_{\mathbf{v}} \mathbf{u} = \frac{\mathbf{u} \cdot \mathbf{v}}{\mathbf{v} \cdot \mathbf{v}} \mathbf{v} $$

where:

  • $\frac{\mathbf{u} \cdot \mathbf{v}}{\mathbf{v} \cdot \mathbf{v}}$ is the scalar projection coefficient,
  • the result is a vector in the direction of $\mathbf{v}$.

Worked Example. Project $\mathbf{u} = [3, 4]^T$ onto $\mathbf{v} = [1, 0]^T$ (the $x$-axis):

$$ \text{proj}_{\mathbf{v}} \mathbf{u} = \frac{3 \cdot 1 + 4 \cdot 0}{1 \cdot 1 + 0 \cdot 0} \begin{bmatrix} 1 \\ 0 \end{bmatrix} = 3 \begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 3 \\ 0 \end{bmatrix} $$

import numpy as np

u = np.array([3.0, 4.0])
v = np.array([1.0, 0.0])

# Projection of u onto v
proj = (np.dot(u, v) / np.dot(v, v)) * v
print(f"Projection: {proj}")  # [3. 0.]

# The residual is orthogonal to v
residual = u - proj
print(f"Residual: {residual}")  # [0. 4.]
print(f"Residual dot v: {np.dot(residual, v)}")  # 0.0 (orthogonal)

Projection onto a Subspace. When projecting onto a subspace spanned by multiple vectors (organized as columns of a matrix $\mathbf{A}$), the projection formula generalizes to:

$$ \text{proj}_{\mathbf{A}} \mathbf{b} = \mathbf{A}(\mathbf{A}^T \mathbf{A})^{-1} \mathbf{A}^T \mathbf{b} $$

where:

  • $\mathbf{A}$ is the matrix whose columns span the target subspace,
  • $(\mathbf{A}^T \mathbf{A})^{-1} \mathbf{A}^T$ is the pseudoinverse (for full-column-rank $\mathbf{A}$).

This is precisely the formula behind ordinary least squares regression. When you fit a linear model $\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}$, you are projecting the target vector $\mathbf{y}$ onto the column space of the feature matrix $\mathbf{X}$. The residuals $\mathbf{y} - \hat{\mathbf{y}}$ are orthogonal to the column space---a direct consequence of the projection theorem that we will use in Chapter 6 when we derive linear regression.

The Gram-Schmidt Process. Given a set of linearly independent vectors, the Gram-Schmidt process produces an orthonormal basis by iteratively projecting out components along previously computed basis vectors. This is the algorithm behind QR decomposition, which we will encounter when discussing numerical stability.

import numpy as np

def gram_schmidt(V):
    """Orthogonalize columns of V using the Gram-Schmidt process."""
    n, k = V.shape
    Q = np.zeros_like(V, dtype=float)
    for j in range(k):
        q = V[:, j].astype(float)
        for i in range(j):
            q -= np.dot(Q[:, i], V[:, j]) * Q[:, i]
        Q[:, j] = q / np.linalg.norm(q)
    return Q

V = np.array([[1, 1], [1, 0], [0, 1]], dtype=float)
Q = gram_schmidt(V)
print(f"Orthonormal basis:\n{Q}")
print(f"Q^T @ Q (should be identity):\n{Q.T @ Q}")

2.2 Matrices and Matrix Operations

2.2.1 Matrices as Data and Transformations

A matrix serves two roles in AI. First, it is a container for data: each row might be a data sample, each column a feature. Second, it is a function: multiplying a vector by a matrix transforms that vector into a new one. Every linear layer in a neural network is a matrix multiplication.

A matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ has $m$ rows and $n$ columns:

$$ \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} $$

where:

  • $\mathbf{A}$ is bold uppercase (matrix convention),
  • $a_{ij}$ is the element in row $i$ and column $j$ (italic scalar),
  • $m \times n$ is the shape of the matrix.
import numpy as np

# A 3x4 matrix: 3 samples, 4 features
A = np.array([
    [1, 2, 3, 4],
    [5, 6, 7, 8],
    [9, 10, 11, 12]
])
print(f"Shape: {A.shape}")        # (3, 4)
print(f"Element A[1,2]: {A[1, 2]}")  # 7 (row 1, col 2, zero-indexed)
print(f"Row 0: {A[0, :]}")        # [1 2 3 4]
print(f"Column 2: {A[:, 2]}")     # [3 7 11]

2.2.2 Matrix Arithmetic

Matrix Addition. Like vector addition, matrix addition is element-wise and requires matching dimensions:

$$ (\mathbf{A} + \mathbf{B})_{ij} = a_{ij} + b_{ij} $$

Matrix Multiplication. The product $\mathbf{C} = \mathbf{A}\mathbf{B}$ is defined when the number of columns of $\mathbf{A}$ equals the number of rows of $\mathbf{B}$. If $\mathbf{A} \in \mathbb{R}^{m \times p}$ and $\mathbf{B} \in \mathbb{R}^{p \times n}$, then $\mathbf{C} \in \mathbb{R}^{m \times n}$:

$$ c_{ij} = \sum_{k=1}^{p} a_{ik} b_{kj} $$

where:

  • each element $c_{ij}$ is the dot product of row $i$ of $\mathbf{A}$ and column $j$ of $\mathbf{B}$,
  • the inner dimensions ($p$) must match,
  • matrix multiplication is not commutative: $\mathbf{A}\mathbf{B} \neq \mathbf{B}\mathbf{A}$ in general.

Worked Example.

$$ \mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, \quad \mathbf{B} = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} $$

$$ \mathbf{A}\mathbf{B} = \begin{bmatrix} 1 \cdot 5 + 2 \cdot 7 & 1 \cdot 6 + 2 \cdot 8 \\ 3 \cdot 5 + 4 \cdot 7 & 3 \cdot 6 + 4 \cdot 8 \end{bmatrix} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix} $$

import numpy as np

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Matrix multiplication (three equivalent ways)
C1 = A @ B
C2 = np.matmul(A, B)
C3 = np.dot(A, B)
print(f"A @ B =\n{C1}")
# [[19 22]
#  [43 50]]

# Verify non-commutativity
print(f"B @ A =\n{B @ A}")
# [[23 34]
#  [31 46]]
print(f"AB == BA? {np.array_equal(A @ B, B @ A)}")  # False

Matrix Multiplication as a Neural Network Layer. Understanding matrix multiplication as the core operation of a neural network layer is essential. In a fully connected (dense) layer, the forward pass computes $\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}$, where $\mathbf{W} \in \mathbb{R}^{m \times n}$ is the weight matrix, $\mathbf{x} \in \mathbb{R}^n$ is the input, and $\mathbf{b} \in \mathbb{R}^m$ is the bias. Each element of $\mathbf{y}$ is a dot product of a row of $\mathbf{W}$ with $\mathbf{x}$---geometrically, measuring how aligned the input is with each "detector" stored in the weight matrix.

Worked Example: Batch Matrix Multiplication. In practice, we process multiple data points simultaneously. If $\mathbf{X} \in \mathbb{R}^{b \times n}$ is a batch of $b$ input vectors (each row is one sample), the forward pass for the batch is:

$$ \mathbf{Y} = \mathbf{X}\mathbf{W}^T + \mathbf{1}\mathbf{b}^T $$

where:

  • $\mathbf{Y} \in \mathbb{R}^{b \times m}$ contains the outputs for all samples,
  • the matrix multiply handles all samples in parallel.
import numpy as np

# Simulate a dense layer with batch processing
batch_size, input_dim, output_dim = 32, 784, 128
X = np.random.randn(batch_size, input_dim)   # Batch of inputs
W = np.random.randn(output_dim, input_dim)   # Weight matrix
b = np.random.randn(output_dim)              # Bias vector

# Forward pass for the entire batch
Y = X @ W.T + b  # Broadcasting adds b to each row
print(f"Input batch shape: {X.shape}")    # (32, 784)
print(f"Weight matrix shape: {W.shape}")  # (128, 784)
print(f"Output batch shape: {Y.shape}")   # (32, 128)

The Hadamard (Element-wise) Product. While matrix multiplication is the workhorse of neural networks, the element-wise product $\mathbf{A} \odot \mathbf{B}$ (where $(A \odot B)_{ij} = a_{ij} b_{ij}$) also plays important roles. It appears in gating mechanisms (LSTMs, GRUs), attention masking, and dropout. Be careful not to confuse A * B (element-wise in NumPy) with A @ B (matrix multiplication).

2.2.3 The Transpose

The transpose of $\mathbf{A}$, denoted $\mathbf{A}^T$, flips the matrix over its diagonal, swapping rows and columns:

$$ (\mathbf{A}^T)_{ij} = a_{ji} $$

Key properties:

  • $(\mathbf{A}^T)^T = \mathbf{A}$
  • $(\mathbf{A}\mathbf{B})^T = \mathbf{B}^T \mathbf{A}^T$ (note the reversal of order)
  • A symmetric matrix satisfies $\mathbf{A} = \mathbf{A}^T$

Symmetric matrices arise naturally in AI: covariance matrices, kernel matrices, and Hessians are all symmetric.

import numpy as np

A = np.array([[1, 2, 3], [4, 5, 6]])
print(f"A shape: {A.shape}")         # (2, 3)
print(f"A^T shape: {A.T.shape}")     # (3, 2)
print(f"A^T =\n{A.T}")

# Symmetric matrix: covariance matrix is always symmetric
data = np.random.randn(100, 3)
cov = np.cov(data, rowvar=False)
print(f"Covariance matrix symmetric: {np.allclose(cov, cov.T)}")  # True

2.2.4 Special Matrices

Several special matrices appear repeatedly in AI:

Matrix Definition AI Usage
Identity $\mathbf{I}$ Diagonal of ones Initialization, skip connections
Diagonal Non-zero only on diagonal Scaling, variance matrices
Orthogonal $\mathbf{Q}^T\mathbf{Q} = \mathbf{I}$ Preserves norms, rotation matrices
Positive definite $\mathbf{x}^T\mathbf{A}\mathbf{x} > 0$ for all $\mathbf{x} \neq \mathbf{0}$ Covariance matrices, convex optimization
import numpy as np

# Identity matrix
I = np.eye(3)
print(f"Identity:\n{I}")

# Diagonal matrix
D = np.diag([2, 3, 5])
print(f"Diagonal:\n{D}")

# Verify: A @ I = A for any A
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(f"A @ I == A: {np.allclose(A @ I, A)}")  # True

2.2.5 Sparse Matrices

In many AI applications, matrices are overwhelmingly filled with zeros. A sparse matrix stores only the non-zero elements, dramatically reducing both memory and computation. Sparse matrices appear throughout AI engineering:

  • Natural language processing: A bag-of-words representation of a document over a vocabulary of 100,000 words might contain only 200 non-zero entries. Storing it as a dense matrix wastes 99.8% of the memory.
  • Graph neural networks: An adjacency matrix for a social network with millions of nodes but only thousands of connections per node is extremely sparse.
  • Recommender systems: A user-item interaction matrix is typically more than 99% empty.
  • One-hot encodings: Each vector has exactly one non-zero entry.

SciPy provides several sparse matrix formats, each optimized for different access patterns:

Format Name Best For
CSR Compressed Sparse Row Row slicing, matrix-vector products
CSC Compressed Sparse Column Column slicing, solving linear systems
COO Coordinate Incremental construction
LIL List of Lists Row-by-row construction
import numpy as np
from scipy import sparse

# Create a sparse matrix (99% zeros)
np.random.seed(42)
dense = np.random.rand(1000, 1000)
dense[dense < 0.99] = 0  # Keep only ~1% of entries

# Convert to CSR format
sparse_matrix = sparse.csr_matrix(dense)

print(f"Dense memory: {dense.nbytes / 1024:.1f} KB")
print(f"Sparse memory: {(sparse_matrix.data.nbytes + sparse_matrix.indices.nbytes + sparse_matrix.indptr.nbytes) / 1024:.1f} KB")
print(f"Non-zero entries: {sparse_matrix.nnz}")
print(f"Sparsity: {1 - sparse_matrix.nnz / (1000 * 1000):.4f}")

# Sparse matrix-vector multiplication is much faster
x = np.random.randn(1000)
result_sparse = sparse_matrix @ x  # Uses only non-zero entries
result_dense = dense @ x
print(f"Results match: {np.allclose(result_sparse, result_dense)}")

Practical tip: When building ML pipelines, always ask yourself whether your matrices are sparse. Using dense operations on sparse data is one of the most common performance pitfalls in AI engineering. Scikit-learn, PyTorch, and TensorFlow all support sparse tensor operations.

2.2.6 The Inverse and Solving Linear Systems

The inverse of a square matrix $\mathbf{A}$, denoted $\mathbf{A}^{-1}$, satisfies:

$$ \mathbf{A} \mathbf{A}^{-1} = \mathbf{A}^{-1} \mathbf{A} = \mathbf{I} $$

where:

  • $\mathbf{I}$ is the identity matrix,
  • $\mathbf{A}$ must be square and non-singular (its determinant is non-zero).

In AI, we rarely compute inverses explicitly because it is numerically unstable and computationally expensive ($O(n^3)$). Instead, we solve the linear system $\mathbf{A}\mathbf{x} = \mathbf{b}$ directly.

Worked Example. Solve the system:

$$ \begin{bmatrix} 2 & 1 \\ 5 & 3 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 4 \\ 7 \end{bmatrix} $$

The analytical solution is $\mathbf{x} = \mathbf{A}^{-1}\mathbf{b}$:

$$ \mathbf{A}^{-1} = \frac{1}{2 \cdot 3 - 1 \cdot 5} \begin{bmatrix} 3 & -1 \\ -5 & 2 \end{bmatrix} = \begin{bmatrix} 3 & -1 \\ -5 & 2 \end{bmatrix} $$

$$ \mathbf{x} = \begin{bmatrix} 3 & -1 \\ -5 & 2 \end{bmatrix} \begin{bmatrix} 4 \\ 7 \end{bmatrix} = \begin{bmatrix} 5 \\ -6 \end{bmatrix} $$

import numpy as np

A = np.array([[2, 1], [5, 3]], dtype=float)
b = np.array([4, 7], dtype=float)

# Method 1: Compute inverse (avoid in practice)
A_inv = np.linalg.inv(A)
x_inv = A_inv @ b
print(f"Solution via inverse: {x_inv}")  # [ 5. -6.]

# Method 2: Solve directly (preferred, more stable)
x_solve = np.linalg.solve(A, b)
print(f"Solution via solve: {x_solve}")  # [ 5. -6.]

# Verify
print(f"A @ x = {A @ x_solve}")  # [4. 7.] = b
print(f"Determinant: {np.linalg.det(A):.1f}")  # 1.0

2.2.7 The Determinant and Trace

The determinant of a square matrix is a scalar that encodes whether the matrix is invertible and how it scales volumes. For a $2 \times 2$ matrix:

$$ \det(\mathbf{A}) = \det \begin{bmatrix} a & b \\ c & d \end{bmatrix} = ad - bc $$

where:

  • $\det(\mathbf{A}) = 0$ means the matrix is singular (non-invertible),
  • $|\det(\mathbf{A})|$ measures the volume scaling factor of the transformation.

The trace is the sum of diagonal elements:

$$ \text{tr}(\mathbf{A}) = \sum_{i=1}^{n} a_{ii} $$

The trace equals the sum of eigenvalues, a fact we will use in Section 2.4.

import numpy as np

A = np.array([[2, 1], [5, 3]])
print(f"Determinant: {np.linalg.det(A):.1f}")  # 1.0
print(f"Trace: {np.trace(A)}")  # 5

# Singular matrix (det = 0)
B = np.array([[1, 2], [2, 4]])
print(f"Det of singular matrix: {np.linalg.det(B):.1f}")  # 0.0

2.3 Linear Transformations

2.3.1 Matrices as Functions

A matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ defines a linear transformation $T: \mathbb{R}^n \to \mathbb{R}^m$ via $T(\mathbf{x}) = \mathbf{A}\mathbf{x}$. The linearity property means:

$$ T(c_1 \mathbf{x}_1 + c_2 \mathbf{x}_2) = c_1 T(\mathbf{x}_1) + c_2 T(\mathbf{x}_2) $$

where:

  • the transformation preserves addition and scalar multiplication,
  • every linear transformation can be represented as a matrix, and every matrix defines one.

This is why neural networks insert non-linear activation functions between layers. Without them, stacking multiple linear layers would collapse into a single linear transformation $\mathbf{A}_k \cdots \mathbf{A}_2 \mathbf{A}_1 = \mathbf{A}_{\text{combined}}$, gaining no representational power.

2.3.2 Geometric Interpretation

Different matrices produce different geometric transformations:

Matrix Type Effect Example
Diagonal Scaling $\begin{bmatrix} 2 & 0 \\ 0 & 3 \end{bmatrix}$ stretches $x$ by 2, $y$ by 3
Rotation Rotation by $\theta$ $\begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}$
Shear Skewing $\begin{bmatrix} 1 & k \\ 0 & 1 \end{bmatrix}$
Projection Dimension reduction $\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \end{bmatrix}$ drops $z$

Worked Example. Rotate the vector $\mathbf{v} = [1, 0]^T$ by $90$ degrees:

$$ \mathbf{R} = \begin{bmatrix} \cos 90° & -\sin 90° \\ \sin 90° & \cos 90° \end{bmatrix} = \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix} $$

$$ \mathbf{R}\mathbf{v} = \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 0 \\ 1 \end{bmatrix} $$

The vector $[1, 0]^T$ pointing along the $x$-axis is rotated to $[0, 1]^T$ pointing along the $y$-axis, exactly as expected for a $90$-degree counterclockwise rotation.

import numpy as np

theta = np.radians(90)
R = np.array([
    [np.cos(theta), -np.sin(theta)],
    [np.sin(theta),  np.cos(theta)]
])

v = np.array([1, 0])
rotated = R @ v
print(f"Rotated vector: {rotated}")  # [0. 1.] (approximately)

# Apply multiple transformations: scale then rotate
S = np.array([[2, 0], [0, 3]])  # Scale x by 2, y by 3
combined = R @ S  # First scale, then rotate
result = combined @ v
print(f"Scale then rotate: {result}")  # [0. 2.]

2.3.3 The Column Space and Null Space

The column space (range) of $\mathbf{A}$ is the set of all vectors that can be expressed as $\mathbf{A}\mathbf{x}$ for some $\mathbf{x}$. It tells you what outputs the transformation can produce.

The null space (kernel) of $\mathbf{A}$ is the set of all vectors $\mathbf{x}$ such that $\mathbf{A}\mathbf{x} = \mathbf{0}$. These are the inputs that the transformation annihilates---they are "invisible" to the matrix.

The rank-nullity theorem connects them:

$$ \text{rank}(\mathbf{A}) + \text{nullity}(\mathbf{A}) = n $$

where:

  • $\text{rank}(\mathbf{A})$ is the dimension of the column space,
  • $\text{nullity}(\mathbf{A})$ is the dimension of the null space,
  • $n$ is the number of columns.

In AI, a large null space means your model has many "directions" in parameter space that produce the same output---a concept related to overparameterization.

import numpy as np
from scipy.linalg import null_space

A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

rank = np.linalg.matrix_rank(A)
print(f"Rank: {rank}")  # 2

# Null space
ns = null_space(A)
print(f"Null space dimension: {ns.shape[1]}")  # 1
print(f"Null space basis:\n{ns}")

# Verify: A @ null_vector should be zero
if ns.shape[1] > 0:
    print(f"A @ null_vector = {A @ ns[:, 0]}")  # ~[0, 0, 0]

2.4 Eigendecomposition

2.4.1 Eigenvalues and Eigenvectors: The Concept

Eigendecomposition reveals the intrinsic structure of a linear transformation. Most vectors change both magnitude and direction when multiplied by a matrix, but eigenvectors are special: they only get scaled.

An eigenvector $\mathbf{v}$ of matrix $\mathbf{A}$ satisfies:

$$ \mathbf{A}\mathbf{v} = \lambda \mathbf{v} $$

where:

  • $\mathbf{v} \neq \mathbf{0}$ is an eigenvector (the direction that is preserved),
  • $\lambda$ is the corresponding eigenvalue (the scaling factor),
  • if $\lambda > 1$, the eigenvector is stretched; if $0 < \lambda < 1$, it is compressed; if $\lambda < 0$, it is flipped.

The intuition: eigenvectors are the "natural axes" of a transformation. Along these axes, the matrix behaves like simple scaling. This simplification is the key to understanding everything from PCA to the stability of dynamical systems.

2.4.2 Computing Eigendecomposition

To find eigenvalues, we solve the characteristic equation:

$$ \det(\mathbf{A} - \lambda \mathbf{I}) = 0 $$

where:

  • $\mathbf{I}$ is the identity matrix,
  • the determinant yields a polynomial of degree $n$ in $\lambda$,
  • the roots of this polynomial are the eigenvalues.

For a $2 \times 2$ matrix $\mathbf{A} = \begin{bmatrix} a & b \\ c & d \end{bmatrix}$:

$$ \lambda^2 - (a + d)\lambda + (ad - bc) = 0 $$

$$ \lambda = \frac{(a + d) \pm \sqrt{(a + d)^2 - 4(ad - bc)}}{2} $$

Worked Example. Find the eigenvalues and eigenvectors of:

$$ \mathbf{A} = \begin{bmatrix} 4 & 2 \\ 1 & 3 \end{bmatrix} $$

Characteristic equation: $\lambda^2 - 7\lambda + 10 = 0$, giving $\lambda_1 = 5$, $\lambda_2 = 2$.

For $\lambda_1 = 5$: $(\mathbf{A} - 5\mathbf{I})\mathbf{v} = \mathbf{0}$ yields $\mathbf{v}_1 = [2, 1]^T$ (up to scaling).

For $\lambda_2 = 2$: $(\mathbf{A} - 2\mathbf{I})\mathbf{v} = \mathbf{0}$ yields $\mathbf{v}_2 = [-1, 1]^T$ (up to scaling).

import numpy as np

A = np.array([[4, 2], [1, 3]], dtype=float)

# Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print(f"Eigenvalues: {eigenvalues}")   # [5. 2.]
print(f"Eigenvectors (columns):\n{eigenvectors}")

# Verify: A @ v = lambda * v for each eigenpair
for i in range(len(eigenvalues)):
    lam = eigenvalues[i]
    v = eigenvectors[:, i]
    Av = A @ v
    lam_v = lam * v
    print(f"lambda={lam:.1f}: A@v = {Av}, lambda*v = {lam_v}, "
          f"match: {np.allclose(Av, lam_v)}")

2.4.3 The Eigendecomposition (Diagonalization)

If $\mathbf{A}$ has $n$ linearly independent eigenvectors, it can be decomposed as:

$$ \mathbf{A} = \mathbf{V} \boldsymbol{\Lambda} \mathbf{V}^{-1} $$

where:

  • $\mathbf{V}$ is the matrix whose columns are eigenvectors,
  • $\boldsymbol{\Lambda}$ is a diagonal matrix of eigenvalues,
  • this factorization separates the "directions" ($\mathbf{V}$) from the "magnitudes" ($\boldsymbol{\Lambda}$).

For symmetric matrices (which include covariance matrices, kernel matrices, and Hessians), the eigendecomposition has a particularly nice form:

$$ \mathbf{A} = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^T $$

where:

  • $\mathbf{Q}$ is an orthogonal matrix ($\mathbf{Q}^T\mathbf{Q} = \mathbf{I}$),
  • all eigenvalues are real,
  • eigenvectors are mutually orthogonal.

This is called the spectral theorem, and it guarantees that symmetric matrices have a clean, well-behaved decomposition.

import numpy as np

A = np.array([[4, 2], [1, 3]], dtype=float)
eigenvalues, V = np.linalg.eig(A)
Lambda = np.diag(eigenvalues)

# Reconstruct A from eigendecomposition
A_reconstructed = V @ Lambda @ np.linalg.inv(V)
print(f"Original A:\n{A}")
print(f"Reconstructed A:\n{A_reconstructed}")
print(f"Match: {np.allclose(A, A_reconstructed)}")  # True

# Symmetric matrix example
S = np.array([[5, 2], [2, 3]], dtype=float)
eigenvalues_s, Q = np.linalg.eigh(S)  # Use eigh for symmetric matrices
print(f"\nSymmetric eigenvalues: {eigenvalues_s}")
print(f"Q^T @ Q = I: {np.allclose(Q.T @ Q, np.eye(2))}")  # True

2.4.4 Eigenvalues in AI

Eigenvalues appear throughout AI engineering:

  1. PCA: Eigenvalues of the covariance matrix tell us the variance explained by each principal component. We will build PCA from scratch in Case Study 2.

  2. Spectral clustering: Eigenvalues of the graph Laplacian reveal cluster structure.

  3. Training stability: The eigenvalues of the Hessian matrix determine whether a critical point is a minimum (all positive), maximum (all negative), or saddle point (mixed signs). We will explore this connection in Chapter 3.

  4. Markov chains: The dominant eigenvector of a transition matrix gives the stationary distribution---a key concept behind PageRank.

  5. Condition number: The ratio of the largest to smallest eigenvalue, $\kappa(\mathbf{A}) = |\lambda_{\max}| / |\lambda_{\min}|$, measures how sensitive a system is to numerical perturbations. Large condition numbers cause training instabilities.

import numpy as np

# Condition number example
A = np.array([[1, 1], [1, 1.0001]])
print(f"Condition number: {np.linalg.cond(A):.0f}")  # Very large!

# Well-conditioned matrix
B = np.array([[2, 0], [0, 3]])
print(f"Condition number: {np.linalg.cond(B):.1f}")  # 1.5

# Eigenvalues of a covariance matrix for PCA preview
np.random.seed(42)
data = np.random.randn(1000, 5)
cov_matrix = np.cov(data, rowvar=False)
eigenvalues = np.linalg.eigvalsh(cov_matrix)
variance_explained = eigenvalues / eigenvalues.sum()
print(f"Variance explained: {variance_explained}")

2.5 Singular Value Decomposition (SVD)

2.5.1 Motivation: Beyond Square Matrices

Eigendecomposition requires a square matrix, but real-world data matrices are almost always rectangular: rows are samples, columns are features, and there is no reason they should be equal. The Singular Value Decomposition generalizes eigendecomposition to any matrix, making it one of the most versatile tools in all of applied mathematics.

SVD answers the question: "What are the fundamental building blocks of this matrix?" It decomposes any matrix into a rotation, a scaling, and another rotation.

2.5.2 The SVD Factorization

Any matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ can be decomposed as:

$$ \mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^T $$

where:

  • $\mathbf{U} \in \mathbb{R}^{m \times m}$ is an orthogonal matrix of left singular vectors (column space basis),
  • $\boldsymbol{\Sigma} \in \mathbb{R}^{m \times n}$ is a diagonal matrix of singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0$ (arranged in decreasing order),
  • $\mathbf{V} \in \mathbb{R}^{n \times n}$ is an orthogonal matrix of right singular vectors (row space basis),
  • $r = \text{rank}(\mathbf{A})$ is the number of non-zero singular values.

The singular values are always non-negative, unlike eigenvalues which can be negative or complex. They are related to eigenvalues by: the singular values of $\mathbf{A}$ are the square roots of the eigenvalues of $\mathbf{A}^T\mathbf{A}$.

Worked Example. Compute the SVD of:

$$ \mathbf{A} = \begin{bmatrix} 3 & 2 & 2 \\ 2 & 3 & -2 \end{bmatrix} $$

import numpy as np

A = np.array([[3, 2, 2],
              [2, 3, -2]], dtype=float)

U, sigma, Vt = np.linalg.svd(A, full_matrices=True)

print(f"U (left singular vectors):\n{U}")
print(f"Singular values: {sigma}")
print(f"V^T (right singular vectors):\n{Vt}")

# Reconstruct A
Sigma_full = np.zeros_like(A, dtype=float)
np.fill_diagonal(Sigma_full, sigma)
A_reconstructed = U @ Sigma_full @ Vt
print(f"\nReconstructed A:\n{A_reconstructed}")
print(f"Match: {np.allclose(A, A_reconstructed)}")  # True

2.5.3 The Truncated SVD and Low-Rank Approximation

The power of SVD lies in the Eckart-Young theorem: the best rank-$k$ approximation of $\mathbf{A}$ (in both Frobenius and spectral norms) is obtained by keeping only the $k$ largest singular values:

$$ \mathbf{A}_k = \mathbf{U}_k \boldsymbol{\Sigma}_k \mathbf{V}_k^T = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^T $$

where:

  • $\mathbf{U}_k$ contains only the first $k$ columns of $\mathbf{U}$,
  • $\boldsymbol{\Sigma}_k$ contains only the $k$ largest singular values,
  • $\mathbf{V}_k^T$ contains only the first $k$ rows of $\mathbf{V}^T$,
  • the approximation error is $\|\mathbf{A} - \mathbf{A}_k\|_F = \sqrt{\sigma_{k+1}^2 + \cdots + \sigma_r^2}$.

This is the mathematical foundation of dimensionality reduction. When most singular values are small, you can discard them with minimal information loss, compressing the matrix dramatically. We will apply this to image compression in Case Study 1.

import numpy as np

# Create a low-rank matrix (rank 2) with noise
np.random.seed(42)
true_A = np.outer(np.random.randn(50), np.random.randn(30))
true_A += np.outer(np.random.randn(50), np.random.randn(30))
noisy_A = true_A + 0.1 * np.random.randn(50, 30)

# Full SVD
U, sigma, Vt = np.linalg.svd(noisy_A, full_matrices=False)

# Truncated SVD (keep top k=2 components)
k = 2
A_approx = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]

# Compute reconstruction error
error = np.linalg.norm(noisy_A - A_approx, 'fro')
total = np.linalg.norm(noisy_A, 'fro')
print(f"Reconstruction error: {error:.4f}")
print(f"Relative error: {error/total:.4f}")
print(f"Top 5 singular values: {sigma[:5].round(2)}")
print(f"Compression ratio: {k * (50 + 30) / (50 * 30):.2%}")

2.5.4 SVD Applications in AI

SVD is ubiquitous in AI and data science:

  1. Image compression: Keep top-$k$ singular values to compress images (Case Study 1).

  2. Latent Semantic Analysis (LSA): SVD on the term-document matrix reveals latent topics in text corpora.

  3. Recommender systems: Decompose user-item rating matrices to predict missing ratings.

  4. Pseudoinverse: The Moore-Penrose pseudoinverse $\mathbf{A}^+ = \mathbf{V}\boldsymbol{\Sigma}^{-1}\mathbf{U}^T$ provides least-squares solutions to overdetermined systems.

  5. Whitening: Transform data to have identity covariance, used as a preprocessing step.

  6. Numerical stability: SVD-based methods are more numerically stable than eigendecomposition for non-symmetric matrices.

import numpy as np

# Pseudoinverse via SVD
A = np.array([[1, 2], [3, 4], [5, 6]], dtype=float)
b = np.array([1, 2, 3], dtype=float)

# Overdetermined system (more equations than unknowns)
# Least-squares solution
x_lstsq = np.linalg.lstsq(A, b, rcond=None)[0]
x_pinv = np.linalg.pinv(A) @ b
print(f"Least-squares solution: {x_lstsq}")
print(f"Pseudoinverse solution: {x_pinv}")
print(f"Match: {np.allclose(x_lstsq, x_pinv)}")  # True

# Verify using SVD manually
U, sigma, Vt = np.linalg.svd(A, full_matrices=False)
sigma_inv = np.diag(1.0 / sigma)
x_svd = Vt.T @ sigma_inv @ U.T @ b
print(f"SVD manual solution: {x_svd}")

2.5.5 SVD for Recommender Systems

One of the most celebrated applications of SVD in AI is collaborative filtering for recommender systems. The idea is elegantly simple: if we arrange user ratings in a matrix $\mathbf{R} \in \mathbb{R}^{m \times n}$ (with $m$ users and $n$ items), then a low-rank approximation of $\mathbf{R}$ reveals latent factors that explain user preferences.

The intuition is that users do not make independent decisions about each item. Instead, a small number of underlying factors---perhaps "likes action movies," "prefers short content," "enjoys comedy"---drive their ratings. SVD discovers these factors automatically.

$$ \mathbf{R} \approx \mathbf{U}_k \boldsymbol{\Sigma}_k \mathbf{V}_k^T $$

where:

  • each row of $\mathbf{U}_k \boldsymbol{\Sigma}_k^{1/2}$ gives a user's coordinates in the latent factor space,
  • each row of $\mathbf{V}_k \boldsymbol{\Sigma}_k^{1/2}$ gives an item's coordinates in the same space,
  • predicting a missing rating reduces to a dot product in the latent space.

Worked Example: Movie Recommendations.

import numpy as np

# Ratings matrix: 5 users x 4 movies (0 = unrated)
R = np.array([
    [5, 3, 0, 1],
    [4, 0, 0, 1],
    [1, 1, 0, 5],
    [1, 0, 0, 4],
    [0, 1, 5, 4],
], dtype=float)

# Fill missing values with row means for SVD
row_means = np.true_divide(R.sum(axis=1), (R != 0).sum(axis=1))
R_filled = R.copy()
for i in range(R.shape[0]):
    R_filled[i, R[i] == 0] = row_means[i]

# Center the ratings
R_centered = R_filled - row_means[:, np.newaxis]

# Compute SVD
U, sigma, Vt = np.linalg.svd(R_centered, full_matrices=False)

# Keep top k=2 latent factors
k = 2
R_approx = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :] + row_means[:, np.newaxis]

print("Original ratings (0 = missing):")
print(R)
print(f"\nPredicted ratings (k={k} factors):")
print(R_approx.round(1))
print(f"\nPredicted rating for User 0, Movie 2: {R_approx[0, 2]:.1f}")

In production systems like the Netflix Prize winner, this basic idea is extended with regularization, implicit feedback, and temporal dynamics. We will revisit recommender systems as a full case study in Chapter 8.

2.5.6 SVD and Principal Component Analysis (PCA)

PCA is the most widely used dimensionality reduction technique in machine learning, and it is a direct application of SVD. Given a data matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ (with $n$ samples and $d$ features, centered so each column has zero mean), PCA finds the directions of maximum variance in the data.

The connection to SVD is immediate. If we compute the SVD of the centered data matrix:

$$ \mathbf{X} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^T $$

then:

  • the columns of $\mathbf{V}$ are the principal components (the directions of maximum variance),
  • the singular values in $\boldsymbol{\Sigma}$ determine the variance along each direction: variance = $\sigma_i^2 / (n-1)$,
  • the projected data (coordinates in the principal component space) is $\mathbf{Z} = \mathbf{U} \boldsymbol{\Sigma}$ (or equivalently $\mathbf{X}\mathbf{V}$).

The fraction of total variance explained by the top $k$ components is:

$$ \text{explained variance ratio} = \frac{\sum_{i=1}^{k} \sigma_i^2}{\sum_{i=1}^{d} \sigma_i^2} $$

Worked Example: PCA from Scratch.

import numpy as np

# Generate correlated 3D data
np.random.seed(42)
n_samples = 200
t = np.random.randn(n_samples)
X = np.column_stack([
    2 * t + 0.5 * np.random.randn(n_samples),
    t + 0.3 * np.random.randn(n_samples),
    0.5 * np.random.randn(n_samples)  # Nearly independent noise
])

# Step 1: Center the data
X_centered = X - X.mean(axis=0)

# Step 2: Compute SVD
U, sigma, Vt = np.linalg.svd(X_centered, full_matrices=False)

# Step 3: Explained variance ratio
explained_var = sigma ** 2 / (n_samples - 1)
explained_ratio = explained_var / explained_var.sum()
print(f"Singular values: {sigma.round(2)}")
print(f"Explained variance ratio: {explained_ratio.round(4)}")
print(f"Cumulative explained variance: {np.cumsum(explained_ratio).round(4)}")

# Step 4: Project to 2D
k = 2
Z = X_centered @ Vt[:k, :].T  # Equivalent to U[:, :k] @ np.diag(sigma[:k])
print(f"\nOriginal shape: {X.shape}")
print(f"Projected shape: {Z.shape}")
print(f"Variance retained: {explained_ratio[:k].sum():.2%}")

This example illustrates a general principle: real-world data often has much lower effective dimensionality than its nominal dimensionality. PCA (via SVD) discovers and exploits this structure. We will use PCA extensively as a preprocessing step and visualization tool in later chapters.

2.5.7 Numerical Stability and SVD

Numerical stability is a critical concern when working with matrix decompositions in practice. The condition number of a matrix---the ratio of its largest to smallest singular value, $\kappa(\mathbf{A}) = \sigma_{\max} / \sigma_{\min}$---determines how sensitive computations are to floating-point rounding errors.

When the condition number is large (the matrix is ill-conditioned), small errors in the input can produce large errors in the output. This directly affects AI systems:

  • Ill-conditioned feature matrices cause unstable linear regression coefficients.
  • Ill-conditioned Hessians (as we will see in Chapter 3) make second-order optimization methods unreliable.
  • Ill-conditioned weight matrices in deep networks can cause gradients to explode or vanish.

SVD provides a principled way to handle ill-conditioning: truncate small singular values rather than inverting them (which would amplify noise). This is exactly what the pseudoinverse does---and why np.linalg.lstsq and np.linalg.pinv are preferred over np.linalg.inv when solving least-squares problems.

import numpy as np

# Ill-conditioned matrix
A = np.array([[1.0, 1.0], [1.0, 1.0001]])
print(f"Condition number: {np.linalg.cond(A):.0f}")  # ~40,000

# Small change in b causes large change in solution
b1 = np.array([2.0, 2.0])
b2 = np.array([2.0, 2.0001])  # Tiny perturbation
x1 = np.linalg.solve(A, b1)
x2 = np.linalg.solve(A, b2)
print(f"Solution 1: {x1}")
print(f"Solution 2: {x2}")
print(f"Change in b: {np.linalg.norm(b2 - b1):.4f}")
print(f"Change in x: {np.linalg.norm(x2 - x1):.4f}")  # Much larger!

# Well-conditioned matrix for comparison
B = np.array([[2.0, 0.0], [0.0, 3.0]])
print(f"\nWell-conditioned matrix condition number: {np.linalg.cond(B):.1f}")  # 1.5

Practical tip: Always check the condition number of matrices before solving linear systems or computing inverses. If $\kappa(\mathbf{A}) > 10^{10}$, the results are likely unreliable in double precision (which has about 16 digits of precision). Use regularization or SVD-based methods instead.


2.6 Matrix Calculus Preview

2.6.1 Why Matrix Calculus?

As we saw in Chapter 1, calculus drives the optimization of machine learning models. But neural networks operate on vectors and matrices, not individual scalars. Matrix calculus extends derivatives to these higher-dimensional objects, enabling us to compute gradients efficiently through entire networks. We will explore this fully in Chapter 3, but this preview establishes the essential vocabulary.

2.6.2 Gradients

The gradient of a scalar function $f: \mathbb{R}^n \to \mathbb{R}$ with respect to a vector $\mathbf{x}$ is the vector of partial derivatives:

$$ \nabla_{\mathbf{x}} f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} $$

where:

  • each component $\frac{\partial f}{\partial x_i}$ is the rate of change of $f$ along the $i$-th direction,
  • the gradient points in the direction of steepest ascent,
  • gradient descent moves in the direction $-\nabla_{\mathbf{x}} f$ to minimize $f$.

Worked Example. For $f(\mathbf{x}) = \mathbf{x}^T \mathbf{A} \mathbf{x}$ where $\mathbf{A}$ is symmetric:

$$ \nabla_{\mathbf{x}} f = 2\mathbf{A}\mathbf{x} $$

This quadratic form appears in ridge regression, where the regularization term is $\lambda \|\mathbf{w}\|^2 = \lambda \mathbf{w}^T \mathbf{w}$.

2.6.3 The Jacobian Matrix

When a function maps vectors to vectors, $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$, the derivative is a matrix called the Jacobian:

$$ \mathbf{J} = \frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} $$

where:

  • $\mathbf{J} \in \mathbb{R}^{m \times n}$,
  • row $i$ is the gradient of $f_i$,
  • the Jacobian is the linear approximation of $\mathbf{f}$ near a point.

2.6.4 Key Matrix Calculus Identities

The following identities appear constantly in AI. You do not need to memorize the proofs, but you should recognize and apply them:

Function $f(\mathbf{x})$ Gradient $\nabla_{\mathbf{x}} f$ Notes
$\mathbf{a}^T \mathbf{x}$ $\mathbf{a}$ Linear function
$\mathbf{x}^T \mathbf{A} \mathbf{x}$ $(\mathbf{A} + \mathbf{A}^T)\mathbf{x}$ Quadratic; $2\mathbf{A}\mathbf{x}$ if $\mathbf{A}$ symmetric
$\|\mathbf{x}\|_2^2$ $2\mathbf{x}$ Special case of above with $\mathbf{A} = \mathbf{I}$
$\|\mathbf{A}\mathbf{x} - \mathbf{b}\|_2^2$ $2\mathbf{A}^T(\mathbf{A}\mathbf{x} - \mathbf{b})$ Least squares objective
import numpy as np

def numerical_gradient(f, x, epsilon=1e-7):
    """Compute the gradient of f at x using central differences.

    Args:
        f: Scalar-valued function of a vector.
        x: Point at which to evaluate the gradient.
        epsilon: Step size for finite differences.

    Returns:
        Approximate gradient as a NumPy array.
    """
    grad = np.zeros_like(x, dtype=float)
    for i in range(len(x)):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus[i] += epsilon
        x_minus[i] -= epsilon
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * epsilon)
    return grad

# Example: f(x) = x^T A x
A = np.array([[2.0, 1.0], [1.0, 3.0]])
x = np.array([1.0, 2.0])

f = lambda x: x @ A @ x
analytical_grad = 2 * A @ x  # Because A is symmetric
numerical_grad = numerical_gradient(f, x)

print(f"Analytical gradient: {analytical_grad}")  # [ 8. 14.]
print(f"Numerical gradient: {numerical_grad}")
print(f"Match: {np.allclose(analytical_grad, numerical_grad)}")  # True

2.6.5 Connecting to Backpropagation

The chain rule, which we saw in Chapter 1 for scalar functions, extends to the matrix setting. If $\mathbf{y} = \mathbf{f}(\mathbf{x})$ and $L = g(\mathbf{y})$, then:

$$ \frac{\partial L}{\partial \mathbf{x}} = \mathbf{J}_{\mathbf{f}}^T \frac{\partial L}{\partial \mathbf{y}} $$

where:

  • $\mathbf{J}_{\mathbf{f}}$ is the Jacobian of $\mathbf{f}$,
  • $\frac{\partial L}{\partial \mathbf{y}}$ is the upstream gradient.

This is precisely what backpropagation computes, layer by layer. For a linear layer $\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}$:

$$ \frac{\partial L}{\partial \mathbf{W}} = \frac{\partial L}{\partial \mathbf{y}} \mathbf{x}^T, \qquad \frac{\partial L}{\partial \mathbf{x}} = \mathbf{W}^T \frac{\partial L}{\partial \mathbf{y}} $$

We will derive and implement these formulas fully in Chapter 3 when we build backpropagation from scratch.

2.6.6 The Hessian Matrix

When we need second-order information about a scalar function---for instance, to determine whether a critical point is a minimum, maximum, or saddle point---we use the Hessian matrix:

$$ \mathbf{H} = \nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix} $$

where:

  • $\mathbf{H} \in \mathbb{R}^{n \times n}$ is symmetric (assuming continuous second derivatives),
  • the eigenvalues of $\mathbf{H}$ determine the local curvature: all positive means a local minimum, all negative means a local maximum, mixed signs means a saddle point.

The Hessian connects matrix calculus back to eigendecomposition (Section 2.4). In deep learning, the loss landscape of a neural network is characterized by the Hessian of the loss function with respect to the parameters. Research has shown that the Hessian at converged solutions tends to have many near-zero eigenvalues (flat directions) and a few large eigenvalues, which partially explains why gradient descent works well despite the high-dimensional non-convex landscape. We will explore this further in Chapter 3.

Worked Example: Hessian of a Quadratic.

For $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T \mathbf{A} \mathbf{x} - \mathbf{b}^T \mathbf{x}$ where $\mathbf{A}$ is symmetric:

$$ \nabla f = \mathbf{A}\mathbf{x} - \mathbf{b}, \qquad \mathbf{H} = \mathbf{A} $$

Setting the gradient to zero gives $\mathbf{x}^* = \mathbf{A}^{-1}\mathbf{b}$. This is a minimum if and only if $\mathbf{A}$ is positive definite (all eigenvalues positive).

import numpy as np

# Quadratic function: f(x) = 0.5 * x^T A x - b^T x
A = np.array([[4.0, 1.0], [1.0, 3.0]])  # Positive definite
b = np.array([1.0, 2.0])

# Optimal point
x_star = np.linalg.solve(A, b)
print(f"Optimal point: {x_star}")

# Verify: Hessian eigenvalues are all positive (=> minimum)
eigenvalues = np.linalg.eigvalsh(A)
print(f"Hessian eigenvalues: {eigenvalues}")
print(f"All positive (minimum): {np.all(eigenvalues > 0)}")

2.6.7 Matrix Calculus for a Full Neural Network Layer

Let us put all the pieces together by computing the gradients for a complete neural network layer. Consider a single layer with ReLU activation:

$$ \mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b}, \qquad \mathbf{a} = \text{ReLU}(\mathbf{z}) $$

Given the upstream gradient $\frac{\partial L}{\partial \mathbf{a}}$, we need to compute three quantities during backpropagation:

  1. Gradient with respect to the pre-activation: $\frac{\partial L}{\partial \mathbf{z}} = \frac{\partial L}{\partial \mathbf{a}} \odot \mathbf{1}[\mathbf{z} > 0]$ (element-wise multiplication with the ReLU derivative)

  2. Gradient with respect to the weights: $\frac{\partial L}{\partial \mathbf{W}} = \frac{\partial L}{\partial \mathbf{z}} \mathbf{x}^T$ (outer product)

  3. Gradient with respect to the input: $\frac{\partial L}{\partial \mathbf{x}} = \mathbf{W}^T \frac{\partial L}{\partial \mathbf{z}}$ (passes the gradient backward)

Notice how the transpose appears naturally: during the forward pass we multiply by $\mathbf{W}$, and during the backward pass we multiply by $\mathbf{W}^T$. This is the matrix chain rule in action.

import numpy as np

def dense_layer_forward(x, W, b):
    """Forward pass of a dense layer with ReLU."""
    z = W @ x + b
    a = np.maximum(0, z)  # ReLU
    return z, a

def dense_layer_backward(dL_da, z, x, W):
    """Backward pass: compute gradients."""
    # Gradient through ReLU
    dL_dz = dL_da * (z > 0).astype(float)
    # Gradient w.r.t. weights
    dL_dW = np.outer(dL_dz, x)
    # Gradient w.r.t. bias
    dL_db = dL_dz
    # Gradient w.r.t. input (to pass backward)
    dL_dx = W.T @ dL_dz
    return dL_dW, dL_db, dL_dx

# Example
np.random.seed(42)
x = np.random.randn(4)       # 4-dim input
W = np.random.randn(3, 4)    # 3x4 weight matrix
b = np.zeros(3)              # 3-dim bias
dL_da = np.random.randn(3)   # Upstream gradient

z, a = dense_layer_forward(x, W, b)
dL_dW, dL_db, dL_dx = dense_layer_backward(dL_da, z, x, W)

print(f"Input shape: {x.shape}")
print(f"Weight gradient shape: {dL_dW.shape}")  # (3, 4), same as W
print(f"Bias gradient shape: {dL_db.shape}")     # (3,), same as b
print(f"Input gradient shape: {dL_dx.shape}")    # (4,), same as x

2.7 Putting It All Together: A NumPy Linear Algebra Toolkit

This section consolidates the key NumPy operations you have learned into a practical reference. Every function here maps directly to a concept covered earlier in this chapter.

2.7.1 Creating Vectors and Matrices

import numpy as np

# Vectors
v = np.array([1, 2, 3])               # From a list
v_zeros = np.zeros(5)                  # Zero vector
v_ones = np.ones(5)                    # Ones vector
v_range = np.arange(0, 10, 2)         # [0, 2, 4, 6, 8]
v_linspace = np.linspace(0, 1, 5)     # 5 evenly spaced values in [0, 1]
v_random = np.random.randn(5)         # Standard normal random vector

# Matrices
M = np.array([[1, 2], [3, 4]])        # From nested lists
M_zeros = np.zeros((3, 4))            # 3x4 zero matrix
M_eye = np.eye(3)                     # 3x3 identity
M_diag = np.diag([1, 2, 3])           # Diagonal matrix
M_random = np.random.randn(3, 4)      # Random 3x4 matrix

2.7.2 Core Operations Reference

import numpy as np

A = np.array([[1, 2], [3, 4]], dtype=float)
B = np.array([[5, 6], [7, 8]], dtype=float)
v = np.array([1, 2], dtype=float)

# --- Arithmetic ---
print("A + B =\n", A + B)             # Element-wise addition
print("A * B =\n", A * B)             # Element-wise multiplication (Hadamard)
print("A @ B =\n", A @ B)             # Matrix multiplication
print("A @ v =\n", A @ v)             # Matrix-vector multiplication

# --- Decompositions ---
eigenvalues, eigenvectors = np.linalg.eig(A)
U, sigma, Vt = np.linalg.svd(A)

# --- Properties ---
print("det(A) =", np.linalg.det(A))
print("trace(A) =", np.trace(A))
print("rank(A) =", np.linalg.matrix_rank(A))
print("cond(A) =", np.linalg.cond(A))
print("norm(A) =", np.linalg.norm(A))           # Frobenius norm
print("norm(v) =", np.linalg.norm(v))           # L2 norm

# --- Solving systems ---
b = np.array([5, 11], dtype=float)
x = np.linalg.solve(A, b)
print("Solution to Ax = b:", x)

2.7.3 Broadcasting and Vectorization

NumPy's broadcasting allows operations between arrays of different shapes, eliminating slow Python loops. Understanding broadcasting is essential for writing efficient AI code.

import numpy as np

# Broadcasting: subtract the mean from each column
data = np.array([[1, 2, 3],
                 [4, 5, 6],
                 [7, 8, 9]], dtype=float)
col_means = data.mean(axis=0)  # [4, 5, 6]
centered = data - col_means    # Broadcasting subtracts from each row
print(f"Centered data:\n{centered}")

# Pairwise distance matrix using broadcasting
X = np.random.randn(5, 3)  # 5 points in 3D
# Expand dimensions for broadcasting
diff = X[:, np.newaxis, :] - X[np.newaxis, :, :]  # (5, 5, 3)
dist_matrix = np.linalg.norm(diff, axis=2)         # (5, 5)
print(f"Distance matrix shape: {dist_matrix.shape}")
print(f"Distance matrix is symmetric: {np.allclose(dist_matrix, dist_matrix.T)}")

2.7.4 Performance Considerations

Linear algebra operations in NumPy call optimized BLAS and LAPACK libraries written in C and Fortran. To maximize performance:

  1. Vectorize: Replace Python loops with NumPy operations.
  2. Avoid copies: Use views and in-place operations when possible.
  3. Choose the right decomposition: Use np.linalg.eigh for symmetric matrices (2x faster than eig), np.linalg.solve instead of inv for solving systems.
  4. Mind memory layout: Row-major (C order) vs. column-major (Fortran order) affects cache performance.
  5. Use appropriate data types: float32 uses half the memory and bandwidth of float64, and is the default for GPU-based deep learning. Only use float64 when you need the extra precision (e.g., numerical gradient checking).
  6. Preallocate arrays: Avoid creating temporary arrays in tight loops by preallocating output buffers.
import numpy as np
import time

n = 1000

# Slow: Python loop for matrix-vector multiplication
A = np.random.randn(n, n)
x = np.random.randn(n)

start = time.time()
result_loop = np.zeros(n)
for i in range(n):
    for j in range(n):
        result_loop[i] += A[i, j] * x[j]
loop_time = time.time() - start

# Fast: NumPy matrix-vector multiplication
start = time.time()
result_numpy = A @ x
numpy_time = time.time() - start

print(f"Loop time: {loop_time:.4f}s")
print(f"NumPy time: {numpy_time:.6f}s")
print(f"Speedup: {loop_time / numpy_time:.0f}x")
print(f"Results match: {np.allclose(result_loop, result_numpy)}")

Memory Layout and Contiguity. NumPy arrays can be stored in C order (row-major, the default) or Fortran order (column-major). When BLAS routines expect one layout and receive the other, they must internally copy the data, destroying the performance advantage. For most AI work, C order is fine, but be aware that some operations---especially column-wise operations on large matrices---may benefit from Fortran order.

import numpy as np

# Check memory layout
A = np.random.randn(1000, 1000)
print(f"C-contiguous: {A.flags['C_CONTIGUOUS']}")  # True
print(f"F-contiguous: {A.flags['F_CONTIGUOUS']}")  # False

# Force Fortran order
A_f = np.asfortranarray(A)
print(f"F-contiguous: {A_f.flags['F_CONTIGUOUS']}")  # True

2.7.5 Common Pitfalls and Debugging Tips

When working with linear algebra in AI code, several common mistakes can cost you hours of debugging:

  1. Shape mismatches: Always print shapes before matrix operations. The most common bug in neural network code is a shape mismatch that either crashes or silently broadcasts incorrectly.

  2. Confusing element-wise and matrix multiplication: In NumPy, * is element-wise and @ is matrix multiplication. In MATLAB, the conventions are reversed. This catches many people transitioning between the two.

  3. Forgetting to center data before PCA: SVD on uncentered data does not give you principal components---it gives you something else entirely.

  4. Numerical issues with log(0) or division by zero: When computing log-likelihoods or normalizing vectors, always add a small epsilon: np.log(x + 1e-12), x / (np.linalg.norm(x) + 1e-12).

  5. Modifying arrays in-place unintentionally: NumPy slicing returns views, not copies. If you modify a slice, the original array changes too. Use .copy() when you need an independent copy.

import numpy as np

# Pitfall: views vs copies
A = np.array([[1, 2], [3, 4]])
row = A[0]      # This is a VIEW, not a copy
row[0] = 999    # This modifies A!
print(f"A after modifying view:\n{A}")  # A[0,0] is now 999

# Safe: make a copy
A = np.array([[1, 2], [3, 4]])
row = A[0].copy()  # Independent copy
row[0] = 999       # A is unchanged
print(f"A after modifying copy:\n{A}")  # A[0,0] is still 1

2.8 Summary

In this chapter, you have built a comprehensive linear algebra toolkit for AI engineering. We began with vectors as the atomic units of data, progressed through matrices as the engines of transformation, and culminated with eigendecomposition and SVD---the two factorizations that power dimensionality reduction, data compression, and stability analysis. Along the way, we covered sparse matrices for efficient storage, numerical stability considerations, and the matrix calculus that bridges linear algebra to the optimization techniques of deep learning.

Key Ideas to Carry Forward

  • Vectors represent data; the dot product measures similarity and underpins attention mechanisms, cosine similarity search, and every linear layer in a neural network.
  • Matrices are linear transformations; multiplication is function composition. Stacking linear transformations without nonlinearities collapses to a single transformation---which is why activation functions are essential.
  • Sparse matrices save memory and computation when most entries are zero, as in NLP bag-of-words models, graph adjacency matrices, and recommender system interaction matrices.
  • Eigendecomposition reveals the natural axes and scaling factors of a transformation. It powers PCA, spectral clustering, stability analysis, and the study of Hessians in optimization.
  • SVD generalizes eigendecomposition to rectangular matrices and provides optimal low-rank approximations. It is the mathematical foundation of PCA, recommender systems, latent semantic analysis, and pseudoinverse computation.
  • Matrix calculus extends derivatives to the vector and matrix settings, enabling gradient-based optimization of neural networks. The Jacobian, Hessian, and chain rule in matrix form are the building blocks of backpropagation.
  • Numerical stability is not a theoretical curiosity---it determines whether your training runs succeed or fail. Condition numbers, the choice between solve and inv, and working in log-space are practical skills every AI engineer needs.

Key Formulas Reference

Concept Formula Section
Dot product $\mathbf{u} \cdot \mathbf{v} = \sum_i u_i v_i$ 2.1.3
Cosine similarity $\cos(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|}$ 2.1.3
Matrix multiply $c_{ij} = \sum_k a_{ik} b_{kj}$ 2.2.2
Eigendecomposition $\mathbf{A} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^{-1}$ 2.4.3
SVD $\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T$ 2.5.2
Gradient of $\mathbf{x}^T\mathbf{A}\mathbf{x}$ $2\mathbf{A}\mathbf{x}$ (for symmetric $\mathbf{A}$) 2.6.2
Backprop through linear layer $\frac{\partial L}{\partial \mathbf{W}} = \frac{\partial L}{\partial \mathbf{y}}\mathbf{x}^T$ 2.6.5

Looking Ahead

In Chapter 3, we will explore optimization and calculus in depth, building directly on the gradient and Jacobian concepts introduced here. The matrix calculus identities from Section 2.6 will become the building blocks of backpropagation. In Chapter 4, we will see how covariance matrices (symmetric, positive semi-definite) arise naturally from probability distributions, connecting the linear algebra of this chapter to the probabilistic foundations of machine learning.

Linear algebra is not a prerequisite to be checked off---it is a living toolkit you will use every day as an AI engineer. The NumPy fluency you have developed here will serve you from data preprocessing to model debugging, from the first line of a project to the last.