Exercises — Chapter 1: Linear Algebra for Machine Learning

DataField.Dev

Exercises — Chapter 1: Linear Algebra for Machine Learning

Computational Exercises

Exercise 1.1 — Rank and Linear Independence ★

Given the following matrix, determine its rank without using np.linalg.matrix_rank. Instead, compute the SVD and count the number of singular values above a reasonable threshold.

A = np.array([
    [1, 2, 3, 4],
    [2, 4, 6, 8],
    [1, 1, 1, 1],
    [3, 5, 7, 9],
    [0, 1, 2, 3],
])

a) What is the rank? How many of the rows are linearly independent?

b) Find a basis for the column space and the null space.

c) Verify the rank-nullity theorem: $\text{rank}(\mathbf{A}) + \dim(N(\mathbf{A})) = n$.

Exercise 1.2 — Eigendecomposition by Hand ★

Compute the eigenvalues and eigenvectors of:

$$\mathbf{A} = \begin{bmatrix} 4 & 2 \\ 1 & 3 \end{bmatrix}$$

a) Solve the characteristic equation $\det(\mathbf{A} - \lambda \mathbf{I}) = 0$.

b) For each eigenvalue, find the corresponding eigenvector by solving $(\mathbf{A} - \lambda \mathbf{I})\mathbf{v} = \mathbf{0}$.

c) Verify your answer with np.linalg.eig.

d) Write $\mathbf{A}$ in the form $\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{-1}$ and verify numerically.

Exercise 1.3 — Spectral Decomposition of a Covariance Matrix ★★

a) Generate a dataset of 1,000 points in $\mathbb{R}^3$ drawn from a multivariate normal distribution with mean $\boldsymbol{\mu} = [1, 2, 3]$ and covariance:

$$\mathbf{\Sigma} = \begin{bmatrix} 4 & 2 & 0.5 \\ 2 & 3 & 1 \\ 0.5 & 1 & 1 \end{bmatrix}$$

b) Compute the sample covariance matrix and its eigendecomposition.

c) Compare the eigenvalues of the sample covariance to the eigenvalues of the true covariance $\mathbf{\Sigma}$. How close are they?

d) Verify that the sample covariance is positive semi-definite (all eigenvalues $\geq 0$).

e) Explain what the eigenvectors represent geometrically about the distribution of the data.

Exercise 1.4 — Implementing PCA Two Ways ★★

Implement PCA using both methods from the chapter and verify they produce identical results (up to sign flips):

a) Method 1: Eigendecomposition of the covariance matrix.

b) Method 2: SVD of the centered data matrix.

c) Apply both to the Iris dataset (sklearn.datasets.load_iris). Reduce from 4 to 2 dimensions. Verify the projected data is the same (up to sign).

d) Compute the explained variance ratio using both methods. They should agree.

Exercise 1.5 — SVD from Scratch ★★★

Implement SVD without calling np.linalg.svd. Use only eigendecomposition:

def svd_from_eigen(A: np.ndarray) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Compute SVD using eigendecomposition of A^T A and A A^T.

    Args:
        A: Matrix of shape (m, n).

    Returns:
        Tuple of (U, sigma, Vt) matching numpy's convention.
    """
    # Your implementation here
    pass

a) Compute $\mathbf{A}^\top \mathbf{A}$ and find its eigendecomposition to get $\mathbf{V}$ and the singular values.

b) Compute $\mathbf{U}$ from the relation $\mathbf{u}_i = \frac{1}{\sigma_i}\mathbf{A}\mathbf{v}_i$.

c) Test on several matrices and compare with np.linalg.svd.

d) Where does your implementation fail numerically? What happens with very small singular values?

Exercise 1.6 — The Four Fundamental Subspaces ★★

For the matrix:

$$\mathbf{A} = \begin{bmatrix} 1 & 2 & 1 \\ 2 & 4 & 3 \\ 3 & 6 & 4 \end{bmatrix}$$

a) Compute the SVD.

b) Using the SVD, find an orthonormal basis for each of the four fundamental subspaces.

c) Verify the orthogonality relationships: column space $\perp$ left null space, row space $\perp$ null space.

d) What is the rank? What do the dimensions of the null spaces tell you about the system $\mathbf{Ax} = \mathbf{b}$?

Exercise 1.7 — Gradient Verification ★★

Implement a gradient checker that compares analytical gradients to numerical (finite difference) gradients.

def check_gradient(
    f: callable,
    grad_f: callable,
    x: np.ndarray,
    eps: float = 1e-5,
    tol: float = 1e-5,
) -> bool:
    """Check analytical gradient against numerical gradient.

    Args:
        f: Scalar function f(x).
        grad_f: Gradient function returning ∇f(x).
        x: Point at which to check.
        eps: Step size for finite differences.
        tol: Tolerance for agreement.

    Returns:
        True if gradients agree within tolerance.
    """
    # Your implementation here
    pass

Use it to verify the gradients for:

a) $f(\mathbf{x}) = \|\mathbf{Ax} - \mathbf{b}\|_2^2$

b) $f(\mathbf{x}) = \mathbf{x}^\top \mathbf{A} \mathbf{x}$ (for symmetric $\mathbf{A}$)

c) $f(\mathbf{x}) = \log(\mathbf{1}^\top \exp(\mathbf{x}))$ (log-sum-exp)

d) $f(\mathbf{W}) = \|\mathbf{XW} - \mathbf{Y}\|_F^2$ (gradient with respect to a matrix)

Exercise 1.8 — Jacobian of the Softmax Function ★★★

The softmax function $\mathbf{s}: \mathbb{R}^n \to \mathbb{R}^n$ is defined by $s_i(\mathbf{z}) = \frac{e^{z_i}}{\sum_j e^{z_j}}$.

a) Derive the Jacobian matrix analytically. Show that:

$$\frac{\partial s_i}{\partial z_j} = s_i(\delta_{ij} - s_j)$$

where $\delta_{ij}$ is the Kronecker delta.

b) Express the Jacobian in matrix form: $\mathbf{J} = \text{diag}(\mathbf{s}) - \mathbf{s}\mathbf{s}^\top$.

c) Implement both the analytical Jacobian and a numerical Jacobian (via finite differences). Verify they agree.

d) What are the singular values of the softmax Jacobian? What does this tell you about gradient flow through softmax layers?

Exercise 1.9 — Low-Rank Approximation for Image Compression ★★

a) Load a grayscale image as a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$.

b) Compute the SVD and plot the singular value spectrum.

c) Reconstruct the image at ranks $k = 1, 5, 10, 20, 50, 100$.

d) For each rank $k$, compute the compression ratio: $\frac{k(m + n + 1)}{mn}$ (storing $\mathbf{U}_k$, $\boldsymbol{\sigma}_k$, and $\mathbf{V}_k^\top$ requires $k(m + n + 1)$ numbers instead of $mn$).

e) At what rank does the reconstruction become visually acceptable? What is the compression ratio at that rank?

Exercise 1.10 — Matrix Factorization with L2 Regularization ★★

Starting from the matrix factorization loss for recommender systems:

$$L(\mathbf{P}, \mathbf{Q}) = \sum_{(i,j) \in \Omega} (R_{ij} - \mathbf{p}_i^\top \mathbf{q}_j)^2 + \lambda(\|\mathbf{P}\|_F^2 + \|\mathbf{Q}\|_F^2)$$

a) Derive $\frac{\partial L}{\partial \mathbf{p}_i}$ and $\frac{\partial L}{\partial \mathbf{q}_j}$.

b) Implement a gradient descent optimizer for this loss.

c) Apply it to the StreamRec data from the progressive project (M0). Compare the results to the SVD-based approach.

d) Experiment with different values of $\lambda$. Plot training and held-out RMSE vs. $\lambda$.

Conceptual Exercises

Exercise 1.11 — Why SVD Reveals the Rank ★

Explain in your own words (2-3 paragraphs) why the number of nonzero singular values equals the rank of a matrix. Your explanation should reference:

a) The relationship between singular values and the eigenvalues of $\mathbf{A}^\top \mathbf{A}$.

b) The connection between the rank of $\mathbf{A}$ and the rank of $\mathbf{A}^\top \mathbf{A}$.

c) Why the null space of $\mathbf{A}$ corresponds to zero singular values.

Exercise 1.12 — Eigenvalues of the Covariance Matrix ★

A covariance matrix $\mathbf{C} \in \mathbb{R}^{d \times d}$ has eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0$.

a) Explain why all eigenvalues are non-negative.

b) What does it mean if $\lambda_d = 0$? What does it imply about the data?

c) If $\lambda_1 = 100$ and $\lambda_2 = 1$, what does this tell you about the geometry of the data? How many principal components would you retain?

d) If all eigenvalues are equal ($\lambda_1 = \lambda_2 = \cdots = \lambda_d$), what does this imply? Is PCA useful in this case?

Exercise 1.13 — The Trace Trick ★★

The trace has the cyclic property: $\text{tr}(\mathbf{ABC}) = \text{tr}(\mathbf{CAB}) = \text{tr}(\mathbf{BCA})$.

a) Prove this property for the case of two matrices: $\text{tr}(\mathbf{AB}) = \text{tr}(\mathbf{BA})$, where $\mathbf{A} \in \mathbb{R}^{m \times n}$ and $\mathbf{B} \in \mathbb{R}^{n \times m}$.

b) Use the trace trick to show that $\|\mathbf{A}\|_F^2 = \text{tr}(\mathbf{A}^\top \mathbf{A}) = \text{tr}(\mathbf{A}\mathbf{A}^\top)$.

c) Use the trace trick to simplify the expression $\mathbf{x}^\top \mathbf{A} \mathbf{x}$ as $\text{tr}(\mathbf{A}\mathbf{x}\mathbf{x}^\top)$. Why is this form useful in deriving matrix calculus identities?

Exercise 1.14 — Positive Definiteness in ML ★★

a) Prove that $\mathbf{X}^\top \mathbf{X}$ is always positive semi-definite for any $\mathbf{X}$.

b) When is $\mathbf{X}^\top \mathbf{X}$ strictly positive definite? State the condition in terms of the columns of $\mathbf{X}$.

c) In ridge regression, we solve $(\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})\mathbf{w} = \mathbf{X}^\top \mathbf{y}$. Prove that $\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}$ is positive definite for any $\lambda > 0$, even when $\mathbf{X}^\top \mathbf{X}$ is singular. (Hint: consider the eigenvalues.)

d) Explain the connection between part (c) and numerical stability.

Exercise 1.15 — Null Space and Regularization ★★

Consider a linear regression problem where $\mathbf{X} \in \mathbb{R}^{100 \times 200}$ (more features than samples).

a) What is the minimum dimension of $N(\mathbf{X})$?

b) If $\mathbf{w}^*$ is a solution to $\mathbf{Xw} = \mathbf{y}$ and $\mathbf{n} \in N(\mathbf{X})$, show that $\mathbf{w}^* + \mathbf{n}$ is also a solution.

c) Explain why L2 regularization selects the minimum-norm solution from this infinite family. (Hint: the regularizer penalizes $\|\mathbf{w}\|_2^2$, and the minimum-norm solution lies in the row space of $\mathbf{X}$.)

d) Connect this to the pseudoinverse: the minimum-norm least-squares solution is $\mathbf{w}^* = \mathbf{X}^+ \mathbf{y}$, where $\mathbf{X}^+ = \mathbf{V}\mathbf{\Sigma}^+\mathbf{U}^\top$ is the Moore-Penrose pseudoinverse computed from the SVD.

Applied Exercises

Exercise 1.16 — SVD for Denoising ★★

a) Generate a clean low-rank matrix $\mathbf{A}_0 \in \mathbb{R}^{100 \times 80}$ of rank 5, then add Gaussian noise: $\mathbf{A} = \mathbf{A}_0 + \sigma \mathbf{N}$ where $N_{ij} \sim \mathcal{N}(0, 1)$.

b) For noise levels $\sigma \in \{0.1, 0.5, 1.0, 2.0\}$, find the optimal rank $k^*$ that minimizes $\|\mathbf{A}_k - \mathbf{A}_0\|_F$ (reconstruction error against the clean matrix, not the noisy one).

c) Is $k^*$ always equal to the true rank (5)? When is it not, and why?

d) Plot the singular value spectrum for each noise level. How does noise affect the spectrum?

Exercise 1.17 — MovieLens Matrix Factorization ★★★

Download the MovieLens 100K dataset (100,000 ratings, 943 users, 1,682 movies).

a) Construct the user-item rating matrix. What is its sparsity?

b) Compute the SVD and plot the singular value spectrum.

c) Implement SVD-based collaborative filtering: predict held-out ratings using the rank-$k$ approximation. Use 80/20 train/test split.

d) Plot RMSE vs. rank $k$ for both training and test sets. Identify the optimal rank.

e) For a specific user, examine the top-5 predicted items. Do they align with the user's known preferences?

Exercise 1.18 — Hessian Analysis of Logistic Regression ★★★

For logistic regression with loss $L(\mathbf{w}) = -\sum_{i=1}^n [y_i \log(\sigma(\mathbf{x}_i^\top \mathbf{w})) + (1-y_i)\log(1 - \sigma(\mathbf{x}_i^\top \mathbf{w}))]$:

a) Derive the Hessian $\mathbf{H} = \nabla^2 L = \mathbf{X}^\top \mathbf{D} \mathbf{X}$, where $\mathbf{D} = \text{diag}(\sigma_i(1-\sigma_i))$ and $\sigma_i = \sigma(\mathbf{x}_i^\top \mathbf{w})$.

b) Prove that $\mathbf{H}$ is positive semi-definite for all $\mathbf{w}$. (This proves logistic regression loss is convex.)

c) Compute the eigenvalues of the Hessian at the optimum for a dataset of your choice. What do they tell you about the conditioning of the problem?

d) How does the condition number of the Hessian relate to the convergence speed of gradient descent?

Exercise 1.19 — Tensor Decomposition for Time Series ★★★

Construct a 3-tensor $\mathcal{T} \in \mathbb{R}^{24 \times 7 \times 52}$ representing electricity consumption: 24 hours × 7 days of the week × 52 weeks.

a) Generate synthetic data with interpretable patterns: a daily cycle (peak at noon), a weekly cycle (lower on weekends), and a seasonal trend (higher in winter/summer).

b) Unfold the tensor along each mode (hours, days, weeks) to get three matrices. Compute PCA on each unfolding.

c) Interpret the first few principal components of each unfolding. Do they correspond to the patterns you embedded?

d) Compare to a CP decomposition (you may use tensorly). How do the rank-1 components relate to the PCA results?

Exercise 1.20 — Condition Number and Numerical Stability ★★

a) Construct the Hilbert matrix $H_{ij} = \frac{1}{i + j - 1}$ for $n = 5, 10, 15, 20$.

b) Compute the condition number $\kappa(\mathbf{H}) = \sigma_1 / \sigma_r$ for each size. How fast does it grow?

c) Solve $\mathbf{H}\mathbf{x} = \mathbf{b}$ for a known $\mathbf{x}$ (choose $\mathbf{x}$, compute $\mathbf{b} = \mathbf{H}\mathbf{x}$, then solve). Measure the relative error $\|\hat{\mathbf{x}} - \mathbf{x}\|_2 / \|\mathbf{x}\|_2$.

d) At what size $n$ does the solution become unreliable? Connect this to the rule of thumb: you lose roughly $\log_{10}(\kappa)$ digits of accuracy.

Exercise 1.21 — Orthogonal Procrustes Problem ★★★

Given two matrices $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{n \times d}$, find the orthogonal matrix $\mathbf{Q}$ that minimizes $\|\mathbf{A}\mathbf{Q} - \mathbf{B}\|_F$.

a) Show that the solution is $\mathbf{Q} = \mathbf{V}\mathbf{U}^\top$, where $\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top = \text{SVD}(\mathbf{A}^\top \mathbf{B})$.

b) Implement this and test it on two sets of 2D points related by a known rotation.

c) This problem arises in aligning word embedding spaces across languages. Generate two sets of word embeddings (synthetic) related by a rotation plus noise, and recover the rotation.

Research-Level Exercises

Exercise 1.22 — Randomized SVD ★★★★

The full SVD of an $m \times n$ matrix costs $O(\min(m, n)^2 \cdot \max(m, n))$. For large matrices where only the top $k$ singular values/vectors are needed, randomized algorithms are far faster.

Implement the randomized SVD algorithm from Halko, Martinsson, and Tropp (2011):

def randomized_svd(
    A: np.ndarray,
    k: int,
    n_oversamples: int = 10,
    n_power_iterations: int = 2,
    seed: int = 42,
) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Compute an approximate rank-k SVD using randomized methods.

    Algorithm:
    1. Draw a random Gaussian matrix Omega of shape (n, k + n_oversamples)
    2. Form Y = A @ Omega
    3. (Optional) Power iteration: Y = (A A^T)^q A Omega for improved accuracy
    4. Compute QR factorization: Y = Q R
    5. Form B = Q^T A (small matrix)
    6. Compute SVD of B: B = U_hat Sigma V^T
    7. Set U = Q U_hat

    Args:
        A: Matrix of shape (m, n).
        k: Target rank.
        n_oversamples: Oversampling parameter (typically 5-10).
        n_power_iterations: Number of power iterations (0-3).
        seed: Random seed.

    Returns:
        Approximate rank-k SVD: (U_k, sigma_k, Vt_k).
    """
    # Your implementation here
    pass

a) Implement the algorithm.

b) Test on a $10000 \times 5000$ matrix with known rank-50 structure plus noise. Compare the accuracy and runtime to the exact SVD (np.linalg.svd).

c) How does the number of power iterations affect accuracy? Plot the relative error $\|\mathbf{A}_k^{\text{rand}} - \mathbf{A}_k^{\text{exact}}\|_F / \|\mathbf{A}_k^{\text{exact}}\|_F$ for 0, 1, 2, 3 power iterations.

d) Analyze the computational cost. What is the theoretical complexity?

e) Compare your implementation to sklearn.utils.extmath.randomized_svd. Are the results comparable?

Exercise 1.23 — The Johnson-Lindenstrauss Lemma ★★★★

The Johnson-Lindenstrauss (JL) lemma states that $n$ points in $\mathbb{R}^d$ can be embedded into $\mathbb{R}^k$ (with $k = O(\epsilon^{-2} \log n)$) while approximately preserving all pairwise distances (within a factor of $1 \pm \epsilon$).

a) Implement the random projection: $\mathbf{X}_{\text{proj}} = \mathbf{X} \mathbf{R}$, where $R_{ij} \sim \mathcal{N}(0, 1/k)$.

b) Generate $n = 1000$ points in $\mathbb{R}^{500}$. For $\epsilon = 0.1$, compute the theoretical $k$ from the JL bound: $k \geq \frac{8 \ln n}{\epsilon^2}$.

c) Project the data and verify empirically that all pairwise distances are preserved within the $1 \pm \epsilon$ bound. What fraction of pairs violate the bound?

d) How does this relate to SVD-based dimensionality reduction? Compare the distance preservation of random projection vs. truncated SVD at the same target dimension.

Exercise 1.24 — Nystr\"om Approximation for Kernel Matrices ★★★★

For a kernel matrix $\mathbf{K} \in \mathbb{R}^{n \times n}$ (where $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ for some kernel function $k$), computing the full eigendecomposition costs $O(n^3)$. The Nystr\"om method provides a low-rank approximation using only a subset of columns.

a) Sample $m \ll n$ landmark points. Compute $\mathbf{K}_{mm} \in \mathbb{R}^{m \times m}$ (kernel between landmarks) and $\mathbf{K}_{nm} \in \mathbb{R}^{n \times m}$ (kernel between all points and landmarks).

b) The Nystr\"om approximation is $\tilde{\mathbf{K}} = \mathbf{K}_{nm} \mathbf{K}_{mm}^{-1} \mathbf{K}_{nm}^\top$. Implement this.

c) Compare the approximation quality (Frobenius norm error) as a function of $m$ for an RBF kernel matrix.

d) How does the choice of landmark points affect accuracy? Compare uniform random sampling vs. k-means centroids.

Exercise 1.25 — Matrix Calculus and the Chain Rule for Neural Networks ★★★

Consider a two-layer network: $\mathbf{h} = \phi(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1)$, $\hat{y} = \mathbf{w}_2^\top \mathbf{h} + b_2$, with MSE loss $L = \frac{1}{2}(\hat{y} - y)^2$.

a) Derive $\frac{\partial L}{\partial \mathbf{w}_2}$, $\frac{\partial L}{\partial b_2}$, $\frac{\partial L}{\partial \mathbf{W}_1}$, $\frac{\partial L}{\partial \mathbf{b}_1}$ by hand, for ReLU activation $\phi$.

b) Implement the forward and backward passes in numpy (no autograd).

c) Verify your gradients using the gradient checker from Exercise 1.7.

d) Train the network on a simple regression task (e.g., $y = \sin(x)$). Plot the loss curve.

Exercise 1.26 — Condition Number and Learning Rate ★★★

The maximum stable learning rate for gradient descent on a quadratic $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^\top \mathbf{A} \mathbf{x} - \mathbf{b}^\top \mathbf{x}$ is $\eta_{\max} = \frac{2}{\lambda_{\max}(\mathbf{A})}$, and the optimal rate is $\eta^* = \frac{2}{\lambda_{\max} + \lambda_{\min}}$.

a) For a symmetric positive definite $\mathbf{A}$ with condition number $\kappa = 100$, implement gradient descent and plot convergence for learning rates $\eta = 0.5\eta^*$, $\eta^*$, $1.5\eta^*$, and $\eta_{\max}$.

b) How does the number of iterations to convergence relate to $\kappa$? (Theory says $O(\kappa \ln(1/\epsilon))$ iterations.)

c) Repeat for $\kappa = 10, 100, 1000, 10000$. Plot iterations-to-convergence vs. $\kappa$.

d) This is why preconditioning (and Adam) matter: they effectively reduce the condition number. Explain this in 2-3 sentences.

Exercise 1.27 — Kronecker-Factored Fisher for a Linear Layer ★★★★

For a single linear layer $\mathbf{z} = \mathbf{W}\mathbf{x}$ with loss $L$:

a) Show that the gradient is $\frac{\partial L}{\partial \text{vec}(\mathbf{W})} = \mathbf{x} \otimes \frac{\partial L}{\partial \mathbf{z}}$ (using the Kronecker product).

b) The Fisher information matrix is $\mathbf{F} = \mathbb{E}[\text{vec}(\nabla L) \text{vec}(\nabla L)^\top]$. Show that if inputs $\mathbf{x}$ and output gradients $\frac{\partial L}{\partial \mathbf{z}}$ are independent, then $\mathbf{F} = \mathbb{E}[\mathbf{x}\mathbf{x}^\top] \otimes \mathbb{E}\left[\frac{\partial L}{\partial \mathbf{z}}\left(\frac{\partial L}{\partial \mathbf{z}}\right)^\top\right]$.

c) Explain why this Kronecker factorization makes the natural gradient (which requires $\mathbf{F}^{-1}$) computationally tractable.

d) Estimate the computational savings: if $\mathbf{W} \in \mathbb{R}^{1000 \times 1000}$, what is the cost of inverting the full Fisher vs. the Kronecker-factored Fisher?

Exercise 1.28 — The SVD and the Pseudoinverse ★★

The Moore-Penrose pseudoinverse $\mathbf{A}^+$ of a matrix $\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top$ is defined as $\mathbf{A}^+ = \mathbf{V}\mathbf{\Sigma}^+\mathbf{U}^\top$, where $\Sigma^+_{ii} = 1/\sigma_i$ for nonzero singular values and 0 otherwise.

a) Implement the pseudoinverse using the SVD.

b) Verify that $\mathbf{A}^+$ satisfies the four Moore-Penrose conditions: - $\mathbf{A}\mathbf{A}^+\mathbf{A} = \mathbf{A}$ - $\mathbf{A}^+\mathbf{A}\mathbf{A}^+ = \mathbf{A}^+$ - $(\mathbf{A}\mathbf{A}^+)^\top = \mathbf{A}\mathbf{A}^+$ - $(\mathbf{A}^+\mathbf{A})^\top = \mathbf{A}^+\mathbf{A}$

c) Show that for an overdetermined system ($m > n$, full rank), $\mathbf{A}^+\mathbf{b}$ gives the least-squares solution. For an underdetermined system ($m < n$), it gives the minimum-norm solution.

d) Compare np.linalg.pinv to your implementation.

Exercise 1.29 — Spectral Clustering ★★★

Spectral clustering uses the eigenvectors of a graph Laplacian matrix.

a) Given $n = 200$ points generated from three well-separated clusters in $\mathbb{R}^2$, construct the similarity matrix $\mathbf{W}$ using a Gaussian kernel: $W_{ij} = \exp(-\|\mathbf{x}_i - \mathbf{x}_j\|^2 / (2\sigma^2))$.

b) Compute the degree matrix $\mathbf{D} = \text{diag}(\mathbf{W}\mathbf{1})$ and the normalized Laplacian $\mathbf{L}_{\text{sym}} = \mathbf{I} - \mathbf{D}^{-1/2}\mathbf{W}\mathbf{D}^{-1/2}$.

c) Compute the bottom 3 eigenvectors of $\mathbf{L}_{\text{sym}}$ (those with smallest eigenvalues).

d) Apply k-means to the rows of the eigenvector matrix. Compare the clustering to k-means applied directly in the original space.

e) Plot the eigenvalue spectrum. Why does the number of near-zero eigenvalues indicate the number of clusters?

Exercise 1.30 — Matrix Completion via Nuclear Norm Minimization ★★★★

The nuclear norm $\|\mathbf{X}\|_* = \sum_i \sigma_i(\mathbf{X})$ is the convex envelope of the rank function. Minimizing the nuclear norm subject to matching observed entries provides the tightest convex relaxation of the rank minimization problem.

a) Implement the proximal operator for the nuclear norm: $\text{prox}_{\tau \|\cdot\|_*}(\mathbf{Y}) = \mathbf{U} \text{diag}(\max(\sigma_i - \tau, 0)) \mathbf{V}^\top$ (singular value thresholding).

b) Implement the Iterative Shrinkage-Thresholding Algorithm (ISTA) for the problem:

$$\min_{\mathbf{X}} \frac{1}{2}\|\mathcal{P}_\Omega(\mathbf{X}) - \mathcal{P}_\Omega(\mathbf{M})\|_F^2 + \lambda \|\mathbf{X}\|_*$$

where $\mathcal{P}_\Omega$ is the projection onto observed entries.

c) Apply to the StreamRec synthetic data from the progressive project. Compare to the direct SVD approach in terms of reconstruction quality on held-out entries.

d) How does the choice of $\lambda$ affect the rank of the solution? Plot rank vs. $\lambda$.

Exercises for Chapter 1 of Advanced Data Science