Chapter 2 Exercises: Linear Algebra for AI
These exercises are organized into five parts of increasing difficulty. Part A covers fundamental operations, Part B addresses matrix properties and decompositions, Part C involves applications to AI problems, Part D presents proof-based and theoretical questions, and Part E offers open-ended implementation challenges.
Part A: Vectors and Basic Operations
Exercise 2.1 (Vector Arithmetic) Given the vectors $\mathbf{u} = [3, -1, 4]^T$ and $\mathbf{v} = [2, 5, -2]^T$, compute by hand and verify with NumPy:
(a) $\mathbf{u} + \mathbf{v}$ (b) $3\mathbf{u} - 2\mathbf{v}$ (c) $\mathbf{u} \cdot \mathbf{v}$ (d) $\|\mathbf{u}\|_1$, $\|\mathbf{u}\|_2$, and $\|\mathbf{u}\|_\infty$
Exercise 2.2 (Cosine Similarity) You have three word embeddings (simplified to $\mathbb{R}^4$):
- "king" $= [1.0, 0.8, 0.2, 0.5]$
- "queen" $= [0.9, 0.9, 0.3, 0.4]$
- "car" $= [0.1, 0.2, 0.9, 0.8]$
(a) Compute the cosine similarity between "king" and "queen." (b) Compute the cosine similarity between "king" and "car." (c) Which pair is more similar? Explain why cosine similarity is preferred over Euclidean distance for word embeddings.
Exercise 2.3 (Projection) Project the vector $\mathbf{u} = [4, 3, 1]^T$ onto $\mathbf{v} = [1, 1, 0]^T$.
(a) Compute $\text{proj}_{\mathbf{v}} \mathbf{u}$ analytically. (b) Compute the residual $\mathbf{u} - \text{proj}_{\mathbf{v}} \mathbf{u}$. (c) Verify that the residual is orthogonal to $\mathbf{v}$.
Exercise 2.4 (Linear Independence) Determine whether the following sets of vectors are linearly independent. Justify your answer using the rank of the matrix formed by the vectors.
(a) $\{[1, 0, 2]^T, [0, 1, -1]^T, [2, 1, 3]^T\}$ (b) $\{[1, 2, 3]^T, [4, 5, 6]^T, [7, 8, 9]^T\}$ (c) $\{[1, 0]^T, [0, 1]^T, [1, 1]^T\}$ (Is this set in $\mathbb{R}^2$?)
Exercise 2.5 (Orthonormal Basis) Apply the Gram-Schmidt process to the vectors $\mathbf{v}_1 = [1, 1, 0]^T$ and $\mathbf{v}_2 = [1, 0, 1]^T$ to produce an orthonormal set $\{\mathbf{e}_1, \mathbf{e}_2\}$. Verify orthonormality by checking that $\mathbf{e}_1 \cdot \mathbf{e}_2 = 0$ and $\|\mathbf{e}_1\| = \|\mathbf{e}_2\| = 1$.
Exercise 2.6 (Norm Comparison) For the vector $\mathbf{x} = [1, -2, 3, -4, 5]^T$:
(a) Compute $\|\mathbf{x}\|_1$, $\|\mathbf{x}\|_2$, and $\|\mathbf{x}\|_\infty$. (b) Verify the inequality $\|\mathbf{x}\|_\infty \leq \|\mathbf{x}\|_2 \leq \|\mathbf{x}\|_1$. (c) Explain why the $L^1$ norm encourages sparsity in model weights, while the $L^2$ norm encourages small but non-zero weights.
Part B: Matrix Operations and Decompositions
Exercise 2.7 (Matrix Multiplication) Given:
$$ \mathbf{A} = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix}, \quad \mathbf{B} = \begin{bmatrix} 5 & 7 \\ 6 & 8 \end{bmatrix} $$
(a) Compute $\mathbf{A}\mathbf{B}$ and $\mathbf{B}\mathbf{A}$ by hand. (b) Show that $\mathbf{A}\mathbf{B} \neq \mathbf{B}\mathbf{A}$. (c) Compute $(\mathbf{A}\mathbf{B})^T$ and verify that it equals $\mathbf{B}^T \mathbf{A}^T$.
Exercise 2.8 (Inverse and Determinant) For the matrix:
$$ \mathbf{A} = \begin{bmatrix} 3 & 1 \\ 5 & 2 \end{bmatrix} $$
(a) Compute $\det(\mathbf{A})$ by hand.
(b) Compute $\mathbf{A}^{-1}$ using the $2 \times 2$ inverse formula.
(c) Verify that $\mathbf{A}\mathbf{A}^{-1} = \mathbf{I}$.
(d) Solve the system $\mathbf{A}\mathbf{x} = [7, 12]^T$ using both the inverse and np.linalg.solve.
Exercise 2.9 (Rank and Null Space) For the matrix:
$$ \mathbf{A} = \begin{bmatrix} 1 & 2 & 3 \\ 2 & 4 & 6 \\ 3 & 6 & 9 \end{bmatrix} $$
(a) Compute the rank of $\mathbf{A}$. (b) Find a basis for the null space of $\mathbf{A}$. (c) Verify the rank-nullity theorem: $\text{rank}(\mathbf{A}) + \text{nullity}(\mathbf{A}) = n$. (d) Explain geometrically what this matrix does to vectors in $\mathbb{R}^3$.
Exercise 2.10 (Eigendecomposition) For the matrix:
$$ \mathbf{A} = \begin{bmatrix} 3 & 1 \\ 0 & 2 \end{bmatrix} $$
(a) Find the eigenvalues by solving $\det(\mathbf{A} - \lambda \mathbf{I}) = 0$. (b) Find the eigenvectors for each eigenvalue. (c) Write the eigendecomposition $\mathbf{A} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^{-1}$. (d) Verify by reconstructing $\mathbf{A}$ from the decomposition.
Exercise 2.11 (Symmetric Eigendecomposition) For the symmetric matrix:
$$ \mathbf{S} = \begin{bmatrix} 4 & 2 \\ 2 & 1 \end{bmatrix} $$
(a) Compute the eigenvalues and eigenvectors. (b) Verify that the eigenvectors are orthogonal. (c) Is $\mathbf{S}$ positive definite, positive semidefinite, or indefinite? Justify using the eigenvalues. (d) Compute $\mathbf{S}^{10}$ using the eigendecomposition (hint: $\mathbf{A}^n = \mathbf{V}\boldsymbol{\Lambda}^n\mathbf{V}^{-1}$).
Exercise 2.12 (SVD by Hand) For the matrix:
$$ \mathbf{A} = \begin{bmatrix} 4 & 0 \\ 3 & -5 \end{bmatrix} $$
(a) Compute $\mathbf{A}^T\mathbf{A}$ and find its eigenvalues.
(b) The singular values of $\mathbf{A}$ are the square roots of these eigenvalues. What are they?
(c) Verify using np.linalg.svd.
Exercise 2.13 (Truncated SVD) Create a $100 \times 50$ random matrix in NumPy. Compute its SVD and:
(a) Plot the singular values. How quickly do they decay? (b) Compute the rank-10 approximation. (c) Report the relative Frobenius norm error $\|\mathbf{A} - \mathbf{A}_{10}\|_F / \|\mathbf{A}\|_F$. (d) What is the compression ratio (number of parameters in the rank-10 representation vs. the original)?
Exercise 2.14 (Matrix Powers via Eigendecomposition) Given:
$$ \mathbf{A} = \begin{bmatrix} 0.9 & 0.2 \\ 0.1 & 0.8 \end{bmatrix} $$
This is a Markov chain transition matrix (rows sum to 1, but we use column-based convention here with columns summing to 1).
(a) Compute $\mathbf{A}^{10}$, $\mathbf{A}^{50}$, and $\mathbf{A}^{100}$ using eigendecomposition. (b) What does $\mathbf{A}^n$ converge to as $n \to \infty$? (c) Find the stationary distribution (the eigenvector corresponding to $\lambda = 1$, normalized to sum to 1).
Part C: AI Applications
Exercise 2.15 (Feature Normalization) Given the data matrix where rows are samples and columns are features:
$$ \mathbf{X} = \begin{bmatrix} 100 & 3 & 20 \\ 200 & 5 & 15 \\ 150 & 4 & 25 \\ 300 & 2 & 10 \end{bmatrix} $$
(a) Compute the mean and standard deviation of each feature (column). (b) Standardize $\mathbf{X}$ (zero mean, unit variance per column). (c) Compute the covariance matrix of the standardized data. What does it tell you about feature correlations?
Exercise 2.16 (Least Squares Regression) Given training data points $(1, 2)$, $(2, 3.5)$, $(3, 5.5)$, $(4, 7)$, $(5, 8.5)$:
(a) Set up the design matrix $\mathbf{X}$ (with a column of ones for the bias) and target vector $\mathbf{y}$.
(b) Solve the normal equations $\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ to find the best-fit line.
(c) Solve using np.linalg.lstsq and verify the results match.
(d) Compute the predicted values and the mean squared error.
Exercise 2.17 (PCA from Scratch) Generate 500 samples from a 2D Gaussian with mean $[3, 5]$ and covariance $\begin{bmatrix} 2 & 1.5 \\ 1.5 & 3 \end{bmatrix}$.
(a) Center the data (subtract the mean). (b) Compute the covariance matrix. (c) Find the eigenvectors and eigenvalues. (d) Project the data onto the first principal component. (e) What fraction of the total variance is captured by the first component?
Exercise 2.18 (Cosine Similarity Search) Create a matrix of 10 random "document embeddings" of dimension 50. Given a query vector, implement a function that returns the $k$ most similar documents using cosine similarity. Your implementation must use only NumPy (no loops over documents).
Exercise 2.19 (Image as a Matrix) Load or create a $64 \times 64$ grayscale "image" matrix (you may use random values or a structured pattern). Compute the SVD and:
(a) Reconstruct the image using ranks $k = 1, 5, 10, 20, 32, 64$. (b) For each rank, compute the compression ratio and the relative error. (c) At what rank does the image become visually "good enough"? (Define a threshold of, say, 5% relative error.)
Exercise 2.20 (Batch Matrix Operations) In a neural network, a batch of $B = 32$ input vectors, each of dimension $d_{in} = 128$, is multiplied by a weight matrix $\mathbf{W} \in \mathbb{R}^{128 \times 64}$ and has a bias $\mathbf{b} \in \mathbb{R}^{64}$ added.
(a) Write NumPy code to compute $\mathbf{Y} = \mathbf{X}\mathbf{W} + \mathbf{b}$ for a batch (using broadcasting for the bias). (b) Verify that $\mathbf{Y}$ has shape $(32, 64)$. (c) Time this operation for batch sizes $B \in \{1, 32, 128, 512, 2048\}$ and plot the results. What do you observe about throughput?
Exercise 2.21 (Softmax as a Vector Operation) The softmax function converts a vector of real numbers to a probability distribution:
$$ \text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}} $$
(a) Implement softmax in NumPy. (b) Show that subtracting $\max(\mathbf{z})$ from $\mathbf{z}$ before exponentiation does not change the result but improves numerical stability. Demonstrate with $\mathbf{z} = [1000, 1001, 1002]$. (c) Compute the Jacobian matrix of softmax analytically. Verify with numerical differentiation.
Part D: Theoretical Questions
Exercise 2.22 (Properties of Eigenvalues) Prove or disprove each statement:
(a) If $\lambda$ is an eigenvalue of $\mathbf{A}$, then $\lambda^2$ is an eigenvalue of $\mathbf{A}^2$. (b) If $\lambda$ is an eigenvalue of $\mathbf{A}$, then $\lambda + c$ is an eigenvalue of $\mathbf{A} + c\mathbf{I}$. (c) The eigenvalues of $\mathbf{A}^T$ are the same as those of $\mathbf{A}$. (d) The trace of $\mathbf{A}$ equals the sum of its eigenvalues.
Exercise 2.23 (SVD and Eigendecomposition Relationship) Let $\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T$ be the SVD of $\mathbf{A}$.
(a) Show that the columns of $\mathbf{V}$ are eigenvectors of $\mathbf{A}^T\mathbf{A}$. (b) Show that the columns of $\mathbf{U}$ are eigenvectors of $\mathbf{A}\mathbf{A}^T$. (c) What is the relationship between the singular values of $\mathbf{A}$ and the eigenvalues of $\mathbf{A}^T\mathbf{A}$?
Exercise 2.24 (Orthogonal Matrix Properties) Let $\mathbf{Q}$ be an orthogonal matrix ($\mathbf{Q}^T\mathbf{Q} = \mathbf{I}$).
(a) Show that $\|\mathbf{Q}\mathbf{x}\|_2 = \|\mathbf{x}\|_2$ for any vector $\mathbf{x}$ (norm preservation). (b) Show that $|\det(\mathbf{Q})| = 1$. (c) Explain why norm-preserving transformations are desirable in neural networks. (Hint: think about gradient flow.)
Exercise 2.25 (Positive Definiteness) For each matrix, determine whether it is positive definite, positive semidefinite, negative definite, negative semidefinite, or indefinite:
(a) $\begin{bmatrix} 2 & -1 \\ -1 & 2 \end{bmatrix}$
(b) $\begin{bmatrix} 1 & 3 \\ 3 & 1 \end{bmatrix}$
(c) $\begin{bmatrix} 4 & 2 \\ 2 & 1 \end{bmatrix}$
(d) For a matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ (data matrix), show that $\mathbf{X}^T\mathbf{X}$ is always positive semidefinite.
Part E: Implementation Challenges
Exercise 2.26 (Power Iteration) Implement the power iteration algorithm to find the dominant eigenvalue and eigenvector of a matrix:
- Start with a random vector $\mathbf{b}_0$.
- Iterate: $\mathbf{b}_{k+1} = \frac{\mathbf{A}\mathbf{b}_k}{\|\mathbf{A}\mathbf{b}_k\|}$.
- The dominant eigenvalue is approximated by $\lambda \approx \mathbf{b}_k^T \mathbf{A} \mathbf{b}_k$.
Test on a $5 \times 5$ random symmetric matrix and compare with np.linalg.eigh.
Exercise 2.27 (Gram-Schmidt QR Decomposition)
Implement the Gram-Schmidt process to compute the QR decomposition of a matrix $\mathbf{A} = \mathbf{Q}\mathbf{R}$, where $\mathbf{Q}$ is orthogonal and $\mathbf{R}$ is upper triangular. Compare your result with np.linalg.qr.
Exercise 2.28 (Low-Rank Matrix Completion) Create a $20 \times 15$ matrix of rank 3. Randomly mask 50% of the entries. Implement a simple alternating least squares algorithm to recover the missing entries:
- Initialize random factor matrices $\mathbf{U} \in \mathbb{R}^{20 \times 3}$ and $\mathbf{V} \in \mathbb{R}^{15 \times 3}$.
- Fix $\mathbf{V}$ and solve for $\mathbf{U}$ using only observed entries.
- Fix $\mathbf{U}$ and solve for $\mathbf{V}$ using only observed entries.
- Repeat until convergence.
Report the reconstruction error on the masked entries.
Exercise 2.29 (Numerical Stability) The Hilbert matrix $H_{ij} = \frac{1}{i + j - 1}$ is notoriously ill-conditioned.
(a) Construct the $5 \times 5$, $10 \times 10$, and $15 \times 15$ Hilbert matrices. (b) Compute their condition numbers. (c) Solve $\mathbf{H}\mathbf{x} = \mathbf{b}$ for a known $\mathbf{x}$ and observe how the solution error grows with matrix size. (d) Explain why ill-conditioned matrices are problematic for numerical computations in AI.
Exercise 2.30 (Efficient Pairwise Distances) Implement a function that computes the pairwise Euclidean distance matrix between two sets of points $\mathbf{X} \in \mathbb{R}^{n \times d}$ and $\mathbf{Y} \in \mathbb{R}^{m \times d}$ using the identity:
$$ \|\mathbf{x}_i - \mathbf{y}_j\|^2 = \|\mathbf{x}_i\|^2 + \|\mathbf{y}_j\|^2 - 2\mathbf{x}_i^T \mathbf{y}_j $$
Your implementation should use only matrix operations (no loops). Compare performance with a naive double-loop implementation for $n = m = 1000$, $d = 128$.
Exercise 2.31 (Randomized SVD) Implement the randomized SVD algorithm:
- Generate a random Gaussian matrix $\boldsymbol{\Omega} \in \mathbb{R}^{n \times k}$.
- Form $\mathbf{Y} = \mathbf{A}\boldsymbol{\Omega}$.
- Compute the QR decomposition $\mathbf{Y} = \mathbf{Q}\mathbf{R}$.
- Form $\mathbf{B} = \mathbf{Q}^T \mathbf{A}$.
- Compute the SVD of $\mathbf{B}$ (a much smaller matrix).
Test on a $1000 \times 500$ matrix and compare accuracy and speed with np.linalg.svd.
Exercise 2.32 (Matrix Calculus Verification) For each of the following, compute the analytical gradient and verify it numerically using finite differences:
(a) $f(\mathbf{x}) = \mathbf{a}^T \mathbf{x}$ where $\mathbf{a} = [1, 2, 3]^T$. (b) $f(\mathbf{x}) = \mathbf{x}^T \mathbf{A} \mathbf{x}$ where $\mathbf{A} = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}$. (c) $f(\mathbf{x}) = \|\mathbf{A}\mathbf{x} - \mathbf{b}\|_2^2$ where $\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix}$ and $\mathbf{b} = [1, 2, 3]^T$.
Exercise 2.33 (Linear Layer Backpropagation) Implement forward and backward passes for a single linear layer $\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}$:
(a) Given $\mathbf{W} \in \mathbb{R}^{3 \times 2}$, $\mathbf{x} \in \mathbb{R}^2$, $\mathbf{b} \in \mathbb{R}^3$, and upstream gradient $\frac{\partial L}{\partial \mathbf{y}} \in \mathbb{R}^3$, compute $\frac{\partial L}{\partial \mathbf{W}}$, $\frac{\partial L}{\partial \mathbf{x}}$, and $\frac{\partial L}{\partial \mathbf{b}}$. (b) Verify your analytical gradients numerically. (c) Extend to a batch of inputs $\mathbf{X} \in \mathbb{R}^{B \times 2}$.