Quiz — Chapter 1: Linear Algebra for Machine Learning

DataField.Dev

Quiz — Chapter 1: Linear Algebra for Machine Learning

Test your understanding of the chapter's core concepts. Attempt each question before revealing the answer.

Question 1 (Multiple Choice)

The rank of a matrix $\mathbf{A} \in \mathbb{R}^{100 \times 50}$ can be at most:

a) 100 b) 50 c) 5000 d) 150

Answer

**b) 50.** The rank of any matrix is at most $\min(m, n)$, where $m$ and $n$ are the number of rows and columns. For a $100 \times 50$ matrix, the maximum rank is 50.

Question 2 (True/False with Justification)

True or False: If a matrix $\mathbf{A}$ has a nontrivial null space (i.e., $N(\mathbf{A}) \neq \{\mathbf{0}\}$), then the system $\mathbf{Ax} = \mathbf{b}$ has infinitely many solutions for every $\mathbf{b}$.

Answer

**False.** A nontrivial null space means that if a solution exists, there are infinitely many (add any null-space vector to a solution to get another solution). But $\mathbf{Ax} = \mathbf{b}$ has a solution only if $\mathbf{b} \in C(\mathbf{A})$ (the column space). If $\mathbf{b} \notin C(\mathbf{A})$, there is *no* solution, regardless of the null space.

Question 3 (Short Answer)

Write the eigendecomposition of a symmetric matrix $\mathbf{A}$ and explain why the spectral theorem guarantees that the eigenvectors are orthogonal.

Answer

$\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^\top$, where $\mathbf{Q}$ is an orthogonal matrix (columns are orthonormal eigenvectors) and $\mathbf{\Lambda}$ is a diagonal matrix of real eigenvalues. The spectral theorem guarantees orthogonality because for a real symmetric matrix, eigenvectors corresponding to distinct eigenvalues are always orthogonal. To see why: if $\mathbf{A}\mathbf{v}_1 = \lambda_1 \mathbf{v}_1$ and $\mathbf{A}\mathbf{v}_2 = \lambda_2 \mathbf{v}_2$ with $\lambda_1 \neq \lambda_2$, then $\lambda_1 \mathbf{v}_1^\top \mathbf{v}_2 = (\mathbf{A}\mathbf{v}_1)^\top \mathbf{v}_2 = \mathbf{v}_1^\top \mathbf{A}^\top \mathbf{v}_2 = \mathbf{v}_1^\top \mathbf{A} \mathbf{v}_2 = \lambda_2 \mathbf{v}_1^\top \mathbf{v}_2$, which implies $(\lambda_1 - \lambda_2)\mathbf{v}_1^\top \mathbf{v}_2 = 0$, so $\mathbf{v}_1^\top \mathbf{v}_2 = 0$.

Question 4 (Multiple Choice)

In the SVD $\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top$ of a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$, the left singular vectors $\mathbf{U}$ span:

a) The null space of $\mathbf{A}$ b) The row space of $\mathbf{A}$ c) The column space of $\mathbf{A}$ (first $r$ columns) and the left null space (remaining columns) d) The column space of $\mathbf{A}^\top$

Answer

**c)** The first $r$ columns of $\mathbf{U}$ (where $r$ is the rank) form an orthonormal basis for the column space $C(\mathbf{A})$, and the remaining $m - r$ columns form an orthonormal basis for the left null space $N(\mathbf{A}^\top)$. Together, the columns of $\mathbf{U}$ provide an orthonormal basis for all of $\mathbb{R}^m$, decomposed into $C(\mathbf{A})$ and its orthogonal complement.

Question 5 (Short Answer)

What is the Eckart-Young-Mirsky theorem, and why is it important for data science?

Answer

The Eckart-Young-Mirsky theorem states that the best rank-$k$ approximation to a matrix $\mathbf{A}$ (in both the Frobenius and spectral norms) is given by the truncated SVD: $\mathbf{A}_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top$. It is important for data science because it provides the mathematical foundation for dimensionality reduction, data compression, and matrix factorization. It tells us that PCA gives the optimal linear dimensionality reduction, that low-rank matrix factorization in recommender systems is well-defined, and that the singular value spectrum quantifies how compressible a dataset is.

Question 6 (True/False with Justification)

True or False: Computing PCA via eigendecomposition of the covariance matrix $\mathbf{X}^\top \mathbf{X} / (n-1)$ is numerically equivalent to computing PCA via SVD of the centered data matrix $\mathbf{X}$.

Answer

**False.** They are mathematically equivalent (they produce the same principal components), but *numerically* they are not equivalent. The covariance approach squares the condition number: $\kappa(\mathbf{X}^\top \mathbf{X}) = \kappa(\mathbf{X})^2$. This means the eigendecomposition of the covariance matrix can lose up to twice as many digits of numerical precision as the SVD of the data matrix. The SVD approach is numerically superior and is the method used by production implementations like scikit-learn's PCA.

Question 7 (Multiple Choice)

A positive definite matrix $\mathbf{A}$ has eigenvalues:

a) All positive b) All non-negative c) All real, but some may be negative d) Possibly complex

Answer

**a) All positive.** By definition, a positive definite matrix satisfies $\mathbf{x}^\top \mathbf{A} \mathbf{x} > 0$ for all nonzero $\mathbf{x}$. Since $\mathbf{A}\mathbf{v} = \lambda \mathbf{v}$ implies $\mathbf{v}^\top \mathbf{A} \mathbf{v} = \lambda \|\mathbf{v}\|^2 > 0$, we must have $\lambda > 0$ for every eigenvalue. (Note: positive *semi-definite* matrices have all non-negative eigenvalues, option b.)

Question 8 (Short Answer)

Compute $\frac{\partial}{\partial \mathbf{w}} \|\mathbf{Xw} - \mathbf{y}\|_2^2$ and set it to zero to derive the normal equations.

Answer

Let $L(\mathbf{w}) = \|\mathbf{Xw} - \mathbf{y}\|_2^2 = (\mathbf{Xw} - \mathbf{y})^\top(\mathbf{Xw} - \mathbf{y})$. Expanding: $L = \mathbf{w}^\top \mathbf{X}^\top \mathbf{X} \mathbf{w} - 2\mathbf{y}^\top \mathbf{X}\mathbf{w} + \mathbf{y}^\top \mathbf{y}$. Taking the gradient: $\nabla_\mathbf{w} L = 2\mathbf{X}^\top \mathbf{X}\mathbf{w} - 2\mathbf{X}^\top \mathbf{y}$. Setting to zero: $\mathbf{X}^\top \mathbf{X}\mathbf{w}^* = \mathbf{X}^\top \mathbf{y}$. These are the normal equations. The solution is $\mathbf{w}^* = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y}$ when $\mathbf{X}^\top \mathbf{X}$ is invertible.

Question 9 (True/False with Justification)

True or False: The Jacobian of a function $\mathbf{f}: \mathbb{R}^3 \to \mathbb{R}^2$ is a $3 \times 2$ matrix.

Answer

**False.** The Jacobian is a $2 \times 3$ matrix. Row $i$ of the Jacobian contains the partial derivatives of $f_i$ with respect to each input variable. For $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian $\mathbf{J} \in \mathbb{R}^{m \times n}$, so for $n = 3$, $m = 2$, we get $\mathbf{J} \in \mathbb{R}^{2 \times 3}$.

Question 10 (Multiple Choice)

In the backpropagation formula $\frac{\partial L}{\partial \mathbf{W}} = \mathbf{X}^\top \frac{\partial L}{\partial \mathbf{Z}}$ for a linear layer $\mathbf{Z} = \mathbf{XW}$, the upstream gradient $\frac{\partial L}{\partial \mathbf{Z}}$ has shape:

a) $(d, h)$ — same shape as $\mathbf{W}$ b) $(n, h)$ — same shape as $\mathbf{Z}$ c) $(n, d)$ — same shape as $\mathbf{X}$ d) $(h, d)$ — the transpose of $\mathbf{W}$'s shape

Answer

**b) $(n, h)$ — same shape as $\mathbf{Z}$.** The gradient of a scalar loss with respect to a matrix always has the same shape as that matrix. Since $\mathbf{Z} \in \mathbb{R}^{n \times h}$, we have $\frac{\partial L}{\partial \mathbf{Z}} \in \mathbb{R}^{n \times h}$. The product $\mathbf{X}^\top \frac{\partial L}{\partial \mathbf{Z}}$ then has shape $(d, n) \times (n, h) = (d, h)$, matching the shape of $\mathbf{W}$.

Question 11 (Short Answer)

The Hessian of a loss function at a critical point has eigenvalues $\{0.01, 0.5, 3.0, 100.0\}$. What is the condition number, and what does it imply about gradient descent convergence?

Answer

The condition number is $\kappa = \lambda_{\max} / \lambda_{\min} = 100.0 / 0.01 = 10{,}000$. This implies gradient descent will converge very slowly. The maximum stable learning rate is limited by the largest eigenvalue ($\eta \leq 2/\lambda_{\max} = 0.02$), but with that learning rate, the component along the smallest eigenvalue's direction converges at a rate proportional to $(1 - \eta \lambda_{\min}) = (1 - 0.0002) \approx 0.9998$, requiring roughly $\kappa \ln(1/\epsilon)$ iterations. Adaptive optimizers like Adam effectively rescale each direction, reducing the effective condition number.

Question 12 (True/False with Justification)

True or False: For any matrix $\mathbf{A}$, the matrices $\mathbf{A}^\top \mathbf{A}$ and $\mathbf{A} \mathbf{A}^\top$ have the same nonzero eigenvalues.

Answer

**True.** If $\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top$ is the SVD, then $\mathbf{A}^\top \mathbf{A} = \mathbf{V}\mathbf{\Sigma}^\top \mathbf{\Sigma}\mathbf{V}^\top$ and $\mathbf{A}\mathbf{A}^\top = \mathbf{U}\mathbf{\Sigma}\mathbf{\Sigma}^\top \mathbf{U}^\top$. The nonzero entries of $\mathbf{\Sigma}^\top \mathbf{\Sigma}$ and $\mathbf{\Sigma}\mathbf{\Sigma}^\top$ are the same: $\sigma_1^2, \ldots, \sigma_r^2$. The two matrices may differ in size and have different numbers of zero eigenvalues, but their nonzero eigenvalues are identical.

Question 13 (Multiple Choice)

Which of the following is NOT a property of orthogonal matrices?

a) $\mathbf{Q}^{-1} = \mathbf{Q}^\top$ b) $\|\mathbf{Qx}\| = \|\mathbf{x}\|$ for all $\mathbf{x}$ c) $\det(\mathbf{Q}) = 1$ d) All eigenvalues have magnitude 1

Answer

**c) $\det(\mathbf{Q}) = 1$.** Orthogonal matrices have $\det(\mathbf{Q}) = \pm 1$. Rotations have determinant $+1$, but reflections have determinant $-1$. Both are orthogonal. The other three properties hold for all orthogonal matrices.

Question 14 (Short Answer)

What is the Kronecker product $\mathbf{A} \otimes \mathbf{B}$ of $\mathbf{A} = \begin{bmatrix} 1 & 2 \end{bmatrix}$ and $\mathbf{B} = \begin{bmatrix} 3 \\ 4 \end{bmatrix}$? What is its shape?

Answer

$\mathbf{A} \otimes \mathbf{B} = \begin{bmatrix} 1 \cdot \begin{bmatrix} 3 \\ 4 \end{bmatrix} & 2 \cdot \begin{bmatrix} 3 \\ 4 \end{bmatrix} \end{bmatrix} = \begin{bmatrix} 3 & 6 \\ 4 & 8 \end{bmatrix}$ Shape: $\mathbf{A}$ is $1 \times 2$, $\mathbf{B}$ is $2 \times 1$, so $\mathbf{A} \otimes \mathbf{B}$ is $(1 \cdot 2) \times (2 \cdot 1) = 2 \times 2$.

Question 15 (True/False with Justification)

True or False: If the singular values of the Jacobian matrices in a neural network are all equal to 1, gradients will neither explode nor vanish during backpropagation.

Answer

**True.** Backpropagation computes the product of Jacobian matrices (one per layer). If each Jacobian has all singular values equal to 1 (i.e., it is an orthogonal transformation), then the product of any number of such matrices also has all singular values equal to 1. The gradient norm is exactly preserved across layers: $\|\mathbf{J}_L \cdots \mathbf{J}_1 \mathbf{g}\| = \|\mathbf{g}\|$. This is the ideal case, and it motivates orthogonal initialization for deep networks.

Question 16 (Multiple Choice)

In the context of recommender systems, the rank of the user-item rating matrix $\mathbf{R}$ represents:

a) The number of users b) The number of items c) The number of independent latent factors explaining user-item interactions d) The sparsity of the matrix

Answer

**c) The number of independent latent factors explaining user-item interactions.** If the rating matrix has (effective) rank $k$, it means user preferences and item characteristics can be described by $k$ independent factors. In the factorization $\mathbf{R} \approx \mathbf{P}\mathbf{Q}^\top$, $k$ is the dimensionality of the latent factor space. The number of users and items determine the dimensions of $\mathbf{R}$, not its rank. Sparsity refers to missing entries, which is related but distinct from rank.

Question 17 (Short Answer)

Explain the difference between np.linalg.eig and np.linalg.eigh. When should you use each?

Answer

`np.linalg.eig` computes eigenvalues and eigenvectors for general (possibly non-symmetric) square matrices. It can return complex eigenvalues and does not guarantee orthogonal eigenvectors. `np.linalg.eigh` is specialized for symmetric (Hermitian) matrices. It exploits the symmetry structure for faster computation and guarantees: (1) all eigenvalues are real, (2) eigenvectors are orthonormal, (3) eigenvalues are returned in ascending order. Use `eigh` whenever the matrix is symmetric — which includes covariance matrices, kernel matrices, Hessians, and graph Laplacians. Use `eig` only for non-symmetric matrices, which are rare in ML applications.

Question 18 (True/False with Justification)

True or False: The Frobenius norm of a matrix equals the $\ell_2$ norm of its singular values.

Answer

**True.** $\|\mathbf{A}\|_F = \sqrt{\sum_{i,j} a_{ij}^2} = \sqrt{\text{tr}(\mathbf{A}^\top \mathbf{A})} = \sqrt{\sum_i \sigma_i^2}$. The last expression is exactly $\|\boldsymbol{\sigma}\|_2$, the $\ell_2$ norm of the vector of singular values. This connection is useful: it means the Frobenius norm captures the total "energy" in a matrix, and the truncated SVD minimizes the energy discarded.

Question 19 (Multiple Choice)

When converting numpy arrays to PyTorch tensors using torch.from_numpy(arr):

a) A deep copy is made, so modifying one does not affect the other b) The tensor shares memory with the numpy array — modifying one modifies the other c) Only integer arrays are shared; float arrays are copied d) The conversion fails for arrays with more than 2 dimensions

Answer

**b) The tensor shares memory with the numpy array.** `torch.from_numpy()` creates a tensor that shares the same underlying data buffer as the numpy array. Modifying either the tensor or the array will modify both. To get an independent copy, use `torch.tensor(arr)` or `torch.from_numpy(arr).clone()`. This memory sharing is efficient (no copy cost) but can cause subtle bugs if not understood.

Question 20 (Short Answer)

The covariance matrix of a dataset has eigenvalues $[50, 10, 2, 0.1, 0.01]$. You want to retain 95% of the total variance. How many principal components do you keep? Show your computation.

Answer

Total variance $= 50 + 10 + 2 + 0.1 + 0.01 = 62.11$. Cumulative variance ratios: - 1 component: $50 / 62.11 = 80.5\%$ - 2 components: $(50 + 10) / 62.11 = 96.6\%$ Two principal components capture $96.6\%$ of the total variance, which exceeds the 95% threshold. Keep **2 components**. Note that the first component alone captures $80.5\%$, illustrating the classic pattern: a few components capture most of the variance, and the remaining components capture noise-level variance ($0.1$ and $0.01$).

Quiz for Chapter 1 of Advanced Data Science