Chapter 2 Quiz: Linear Algebra for AI

Test your understanding of the key concepts from this chapter. Each question has a hidden answer---try to answer before revealing it.


Question 1. What is the result of the dot product $\mathbf{u} \cdot \mathbf{v}$ where $\mathbf{u} = [2, -1, 3]^T$ and $\mathbf{v} = [4, 0, -2]^T$?

Show Answer $\mathbf{u} \cdot \mathbf{v} = (2)(4) + (-1)(0) + (3)(-2) = 8 + 0 - 6 = 2$. The dot product is a scalar computed by summing the element-wise products.

Question 2. If two vectors have a cosine similarity of 0, what does this tell you about their relationship?

Show Answer The vectors are **orthogonal** (perpendicular). Cosine similarity of 0 means the angle between them is 90 degrees, so $\cos(90°) = 0$. In the context of word embeddings, this means the two words have no semantic similarity as measured by the model.

Question 3. What is the $L^2$ norm of the vector $\mathbf{v} = [3, 4]^T$?

Show Answer $\|\mathbf{v}\|_2 = \sqrt{3^2 + 4^2} = \sqrt{9 + 16} = \sqrt{25} = 5$. This is the familiar Pythagorean theorem applied to find the length of the vector.

Question 4. Given a set of three vectors in $\mathbb{R}^2$, can they be linearly independent? Why or why not?

Show Answer **No.** In $\mathbb{R}^2$, the maximum number of linearly independent vectors is 2 (the dimension of the space). Any third vector must be a linear combination of the other two. More generally, in $\mathbb{R}^n$, at most $n$ vectors can be linearly independent.

Question 5. What is the shape of the product $\mathbf{C} = \mathbf{A}\mathbf{B}$ if $\mathbf{A} \in \mathbb{R}^{3 \times 5}$ and $\mathbf{B} \in \mathbb{R}^{5 \times 2}$?

Show Answer $\mathbf{C} \in \mathbb{R}^{3 \times 2}$. The inner dimensions (5) must match and are "consumed" by the multiplication. The resulting shape takes the outer dimensions: 3 rows from $\mathbf{A}$ and 2 columns from $\mathbf{B}$.

Question 6. Is matrix multiplication commutative? That is, does $\mathbf{A}\mathbf{B} = \mathbf{B}\mathbf{A}$ in general?

Show Answer **No.** Matrix multiplication is generally not commutative. In fact, if $\mathbf{A}$ is $m \times n$ and $\mathbf{B}$ is $n \times p$ with $m \neq p$, the product $\mathbf{B}\mathbf{A}$ is not even defined. Even when both products are defined (both matrices are square), $\mathbf{A}\mathbf{B} \neq \mathbf{B}\mathbf{A}$ in general. A notable exception is when both matrices are diagonal.

Question 7. What does it mean for a matrix to be singular?

Show Answer A matrix is **singular** if its determinant is zero, which means it is **not invertible**. Equivalently, a singular matrix has at least one eigenvalue equal to zero, has a non-trivial null space, and maps some non-zero vectors to the zero vector. In practical terms, the system $\mathbf{A}\mathbf{x} = \mathbf{b}$ either has no solution or infinitely many solutions when $\mathbf{A}$ is singular.

Question 8. If $\mathbf{A}$ is a $4 \times 4$ matrix with $\text{rank}(\mathbf{A}) = 3$, what is the dimension of its null space?

Show Answer By the rank-nullity theorem, $\text{rank}(\mathbf{A}) + \text{nullity}(\mathbf{A}) = n$, where $n = 4$ is the number of columns. Therefore, $\text{nullity}(\mathbf{A}) = 4 - 3 = 1$. The null space is one-dimensional, meaning there is one "direction" in which the matrix annihilates inputs.

Question 9. Why should you use np.linalg.solve(A, b) instead of np.linalg.inv(A) @ b to solve $\mathbf{A}\mathbf{x} = \mathbf{b}$?

Show Answer `np.linalg.solve` is preferred for two reasons: 1. **Numerical stability**: Computing the inverse explicitly amplifies rounding errors, especially for ill-conditioned matrices. `solve` uses LU decomposition with partial pivoting, which is more numerically stable. 2. **Efficiency**: Computing the inverse requires $O(n^3)$ operations and then multiplying by $\mathbf{b}$ requires another $O(n^2)$. Solving directly via LU decomposition is faster since it avoids forming the full inverse.

Question 10. What is the geometric interpretation of a matrix with $\det(\mathbf{A}) = 0$?

Show Answer A determinant of zero means the transformation **collapses at least one dimension**. The matrix maps $n$-dimensional space onto a lower-dimensional subspace (a line, plane, etc.). Geometrically, it squashes a volume to zero. For example, a $2 \times 2$ matrix with zero determinant maps all of $\mathbb{R}^2$ onto a line (or a point).

Question 11. What does an eigenvalue of $\lambda = 1$ tell you about the corresponding eigenvector?

Show Answer An eigenvalue of $\lambda = 1$ means the matrix leaves the eigenvector **completely unchanged**: $\mathbf{A}\mathbf{v} = 1 \cdot \mathbf{v} = \mathbf{v}$. The eigenvector is a fixed direction of the transformation. This is important in Markov chains, where the stationary distribution is the eigenvector corresponding to $\lambda = 1$.

Question 12. For a symmetric matrix, what is special about its eigendecomposition?

Show Answer For a symmetric matrix $\mathbf{A} = \mathbf{A}^T$, the spectral theorem guarantees: 1. All eigenvalues are **real** (not complex). 2. Eigenvectors corresponding to distinct eigenvalues are **orthogonal**. 3. The matrix can be diagonalized as $\mathbf{A} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^T$ where $\mathbf{Q}$ is orthogonal ($\mathbf{Q}^{-1} = \mathbf{Q}^T$). In NumPy, you should use `np.linalg.eigh` for symmetric matrices, which is faster and more numerically stable than `np.linalg.eig`.

Question 13. What is the condition number of a matrix, and why does it matter for AI?

Show Answer The condition number $\kappa(\mathbf{A}) = \|\mathbf{A}\| \cdot \|\mathbf{A}^{-1}\|$, which for the $L^2$ norm equals $\frac{\sigma_{\max}}{\sigma_{\min}} = \frac{|\lambda_{\max}|}{|\lambda_{\min}|}$ for symmetric matrices. A large condition number means: - Small perturbations in the input cause large changes in the output. - Numerical solutions to $\mathbf{A}\mathbf{x} = \mathbf{b}$ are unreliable. - Gradient descent on loss surfaces with high condition number (the Hessian) converges slowly, oscillating along poorly-conditioned directions. A condition number close to 1 indicates a well-conditioned matrix.

Question 14. In the SVD $\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T$, what are the dimensions of each factor if $\mathbf{A} \in \mathbb{R}^{100 \times 50}$?

Show Answer For the full SVD: - $\mathbf{U} \in \mathbb{R}^{100 \times 100}$ (orthogonal, left singular vectors) - $\boldsymbol{\Sigma} \in \mathbb{R}^{100 \times 50}$ (rectangular diagonal, singular values) - $\mathbf{V}^T \in \mathbb{R}^{50 \times 50}$ (orthogonal, right singular vectors) For the economy (reduced) SVD (which NumPy returns with `full_matrices=False`): - $\mathbf{U} \in \mathbb{R}^{100 \times 50}$ - $\boldsymbol{\Sigma} \in \mathbb{R}^{50 \times 50}$ (or just a vector of 50 values) - $\mathbf{V}^T \in \mathbb{R}^{50 \times 50}$

Question 15. What does the Eckart-Young theorem state, and why is it important?

Show Answer The Eckart-Young theorem states that the **best rank-$k$ approximation** of a matrix $\mathbf{A}$ (in both the Frobenius norm and spectral norm) is given by the truncated SVD: $\mathbf{A}_k = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^T$. This is important because it provides the mathematical foundation for **dimensionality reduction** and **data compression**. When we keep only the top $k$ singular values/vectors, we are provably getting the best possible approximation of rank $k$. This underpins PCA, latent semantic analysis, image compression, and many other techniques.

Question 16. What is the difference between np.dot(A, B), A @ B, and A * B in NumPy?

Show Answer - `np.dot(A, B)` and `A @ B` perform **matrix multiplication** (or dot product for 1D arrays). They are equivalent for 2D arrays. - `A * B` performs **element-wise (Hadamard) multiplication**. Each element $c_{ij} = a_{ij} \cdot b_{ij}$. For example, with $2 \times 2$ matrices: - `A @ B` computes $c_{11} = a_{11}b_{11} + a_{12}b_{21}$ (sum of products of row and column) - `A * B` computes $c_{11} = a_{11}b_{11}$ (just multiply corresponding elements) The `@` operator was introduced in Python 3.5 and is the recommended way to express matrix multiplication.

Question 17. Why is stacking multiple linear layers without activation functions equivalent to a single linear layer?

Show Answer If layer 1 computes $\mathbf{h} = \mathbf{W}_1 \mathbf{x}$ and layer 2 computes $\mathbf{y} = \mathbf{W}_2 \mathbf{h}$, then: $$\mathbf{y} = \mathbf{W}_2 (\mathbf{W}_1 \mathbf{x}) = (\mathbf{W}_2 \mathbf{W}_1) \mathbf{x} = \mathbf{W}_{\text{combined}} \mathbf{x}$$ The composition of linear transformations is itself a linear transformation. No matter how many linear layers you stack, the result is equivalent to a single matrix $\mathbf{W}_{\text{combined}}$. This is why **non-linear activation functions** (ReLU, sigmoid, etc.) are essential---they break the linearity and allow neural networks to learn complex, non-linear relationships.

Question 18. What is the gradient of $f(\mathbf{x}) = \|\mathbf{x}\|_2^2$ with respect to $\mathbf{x}$?

Show Answer $\nabla_{\mathbf{x}} f = 2\mathbf{x}$. We can derive this: $f(\mathbf{x}) = \|\mathbf{x}\|_2^2 = \mathbf{x}^T\mathbf{x} = \sum_i x_i^2$. Taking the partial derivative with respect to $x_i$ gives $\frac{\partial f}{\partial x_i} = 2x_i$. Stacking these into a vector gives $\nabla f = 2\mathbf{x}$. This is a special case of $\nabla_{\mathbf{x}} (\mathbf{x}^T \mathbf{A} \mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}$ with $\mathbf{A} = \mathbf{I}$.

Question 19. How are the eigenvalues of $\mathbf{A}^T\mathbf{A}$ related to the singular values of $\mathbf{A}$?

Show Answer The singular values of $\mathbf{A}$ are the **square roots** of the eigenvalues of $\mathbf{A}^T\mathbf{A}$. If $\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T$, then $\mathbf{A}^T\mathbf{A} = \mathbf{V}\boldsymbol{\Sigma}^T\boldsymbol{\Sigma}\mathbf{V}^T = \mathbf{V}\boldsymbol{\Sigma}^2\mathbf{V}^T$. This is the eigendecomposition of $\mathbf{A}^T\mathbf{A}$, with eigenvalues $\sigma_i^2$ and eigenvectors as the columns of $\mathbf{V}$. Note that since $\mathbf{A}^T\mathbf{A}$ is positive semidefinite, all its eigenvalues are non-negative, which is why singular values are always non-negative.

Question 20. A matrix $\mathbf{A}$ has eigenvalues $\{3, 1, -2\}$. What is $\text{tr}(\mathbf{A})$? What is $\det(\mathbf{A})$?

Show Answer - $\text{tr}(\mathbf{A}) = 3 + 1 + (-2) = 2$ (the trace equals the sum of eigenvalues). - $\det(\mathbf{A}) = 3 \times 1 \times (-2) = -6$ (the determinant equals the product of eigenvalues). These are fundamental identities that connect scalar matrix properties to the spectrum.

Question 21. What is the pseudoinverse, and when would you use it instead of the regular inverse?

Show Answer The Moore-Penrose pseudoinverse $\mathbf{A}^+$ generalizes the inverse to **non-square and singular matrices**. It is computed via SVD: $\mathbf{A}^+ = \mathbf{V}\boldsymbol{\Sigma}^+\mathbf{U}^T$, where $\boldsymbol{\Sigma}^+$ is formed by taking the reciprocal of each non-zero singular value. You would use the pseudoinverse when: - The matrix is rectangular (overdetermined or underdetermined systems). - The matrix is singular (the regular inverse does not exist). - You want the least-squares solution to $\mathbf{A}\mathbf{x} = \mathbf{b}$ (for overdetermined systems) or the minimum-norm solution (for underdetermined systems). In NumPy: `np.linalg.pinv(A)`.

Question 22. You have a $1000 \times 500$ data matrix and compute its SVD. The singular values are $[100, 50, 25, 12, 6, 3, 1.5, 0.7, \ldots]$. If you keep only the top 3 singular values, approximately what fraction of the "energy" (sum of squared singular values) do you retain?

Show Answer The total energy is $\sum_i \sigma_i^2 = 100^2 + 50^2 + 25^2 + 12^2 + 6^2 + 3^2 + 1.5^2 + 0.7^2 + \cdots = 10000 + 2500 + 625 + 144 + 36 + 9 + 2.25 + 0.49 + \cdots \approx 13317$. The top 3 components capture $10000 + 2500 + 625 = 13125$. Fraction retained $\approx 13125 / 13317 \approx 98.6\%$. This demonstrates how SVD can achieve dramatic dimensionality reduction when singular values decay quickly---keeping just 3 out of 500 components captures over 98% of the information.

Question 23. What is broadcasting in NumPy, and how does it relate to linear algebra operations?

Show Answer Broadcasting is NumPy's mechanism for performing element-wise operations on arrays of different shapes by automatically expanding dimensions. For example, adding a vector $\mathbf{b} \in \mathbb{R}^n$ to every row of a matrix $\mathbf{X} \in \mathbb{R}^{m \times n}$:
result = X + b  # b is broadcast across rows
This is equivalent to adding $\mathbf{b}$ to each row individually but is much more efficient because it avoids Python loops. In linear algebra terms, broadcasting implements the bias addition in $\mathbf{Y} = \mathbf{X}\mathbf{W} + \mathbf{b}$ and mean-centering operations like $\mathbf{X} - \boldsymbol{\mu}$, where $\boldsymbol{\mu}$ is the mean vector.

Question 24. An orthogonal matrix $\mathbf{Q}$ satisfies $\mathbf{Q}^T\mathbf{Q} = \mathbf{I}$. What are its eigenvalues?

Show Answer The eigenvalues of an orthogonal matrix all have **absolute value 1**. They lie on the unit circle in the complex plane. Proof sketch: If $\mathbf{Q}\mathbf{v} = \lambda\mathbf{v}$, then $\|\mathbf{Q}\mathbf{v}\|^2 = |\lambda|^2 \|\mathbf{v}\|^2$. But since $\mathbf{Q}$ preserves norms, $\|\mathbf{Q}\mathbf{v}\|^2 = \|\mathbf{v}\|^2$, so $|\lambda|^2 = 1$, meaning $|\lambda| = 1$. For real orthogonal matrices, the real eigenvalues are exactly $+1$ or $-1$. Complex eigenvalues come in conjugate pairs on the unit circle.

Question 25. In the context of neural networks, what does the Jacobian matrix of a layer represent, and how is it used in backpropagation?

Show Answer The Jacobian matrix $\mathbf{J} = \frac{\partial \mathbf{y}}{\partial \mathbf{x}}$ of a layer represents the **local linear approximation** of how the layer's outputs change with respect to its inputs. Each entry $J_{ij} = \frac{\partial y_i}{\partial x_j}$ tells how output $i$ responds to a small change in input $j$. In backpropagation, the Jacobian is used to propagate gradients backward through the network via the chain rule: $$\frac{\partial L}{\partial \mathbf{x}} = \mathbf{J}^T \frac{\partial L}{\partial \mathbf{y}}$$ The upstream gradient $\frac{\partial L}{\partial \mathbf{y}}$ is multiplied by the transpose of the Jacobian to produce the downstream gradient $\frac{\partial L}{\partial \mathbf{x}}$. This is repeated layer by layer from the loss back to the input. In practice, we never form the full Jacobian explicitly---we compute the vector-Jacobian product (VJP) directly, which is much more efficient.