Chapter 2: Key Takeaways

Vectors and Vector Spaces

A vector is an ordered list of numbers that represents a data point, embedding, or parameter set. In AI, vectors range from 2D (for visualization) to hundreds of thousands of dimensions (vocabulary embeddings).
The dot product $\mathbf{u} \cdot \mathbf{v} = \sum_i u_i v_i$ measures alignment between vectors. It is the fundamental operation in every neural network forward pass.
Cosine similarity normalizes the dot product by vector lengths, making it the standard similarity measure for embeddings: $\cos\theta = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|}$.
Norms quantify vector magnitude. The $L^1$ norm encourages sparsity (Lasso), the $L^2$ norm encourages small weights (Ridge), and the $L^\infty$ norm bounds adversarial perturbations.
A set of vectors is linearly independent if no vector can be written as a combination of the others. The rank of a matrix tells you how many independent vectors it contains---and thus the true dimensionality of your data.

A matrix serves dual roles: as a data container (rows = samples, columns = features) and as a linear transformation (multiplication maps inputs to outputs).
Matrix multiplication is the workhorse of AI: every linear layer computes $\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}$. It is associative but not commutative.
The transpose flips rows and columns. The identity $(\mathbf{A}\mathbf{B})^T = \mathbf{B}^T\mathbf{A}^T$ appears frequently in gradient derivations.
The determinant tells you whether a matrix is invertible ($\det \neq 0$) and how it scales volumes. The trace equals the sum of eigenvalues.
Prefer np.linalg.solve(A, b) over np.linalg.inv(A) @ b---it is more numerically stable and computationally efficient.

Every matrix defines a linear transformation, and every linear transformation can be represented as a matrix. This is why neural networks need non-linear activations between linear layers.
The column space is the set of possible outputs; the null space is the set of inputs that map to zero. The rank-nullity theorem connects them: $\text{rank} + \text{nullity} = n$.

An eigenvector $\mathbf{v}$ of $\mathbf{A}$ satisfies $\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$: the matrix only scales it, without changing its direction.
The eigendecomposition $\mathbf{A} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^{-1}$ separates a transformation into its natural directions ($\mathbf{V}$) and scaling factors ($\boldsymbol{\Lambda}$).
For symmetric matrices (covariance matrices, Hessians, kernels), the spectral theorem guarantees real eigenvalues and orthogonal eigenvectors: $\mathbf{A} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^T$.
Eigenvalues of the covariance matrix drive PCA; eigenvalues of the Hessian determine optimization landscape properties (minima vs. saddle points).
The condition number $\kappa = |\lambda_{\max}|/|\lambda_{\min}|$ measures numerical sensitivity. Large condition numbers cause slow convergence and numerical instability.

SVD generalizes eigendecomposition to any matrix: $\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T$, decomposing it into left singular vectors, singular values, and right singular vectors.
Singular values are always non-negative and ordered: $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$.
The Eckart-Young theorem guarantees that the best rank-$k$ approximation is the truncated SVD, making it the mathematical foundation of dimensionality reduction.
SVD applications in AI include image compression, latent semantic analysis, recommender systems, computing the pseudoinverse, and data whitening.

The gradient $\nabla_{\mathbf{x}} f$ of a scalar function with respect to a vector points in the direction of steepest ascent.
The Jacobian $\mathbf{J}$ generalizes the gradient when both input and output are vectors.
Key identities: $\nabla_{\mathbf{x}}(\mathbf{a}^T\mathbf{x}) = \mathbf{a}$, $\nabla_{\mathbf{x}}(\mathbf{x}^T\mathbf{A}\mathbf{x}) = 2\mathbf{A}\mathbf{x}$ (for symmetric $\mathbf{A}$), $\nabla_{\mathbf{x}}\|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 = 2\mathbf{A}^T(\mathbf{A}\mathbf{x} - \mathbf{b})$.
Backpropagation is the chain rule applied to vector-valued functions: $\frac{\partial L}{\partial \mathbf{x}} = \mathbf{J}^T \frac{\partial L}{\partial \mathbf{y}}$.

Use @ for matrix multiplication, * for element-wise multiplication.
Use np.linalg.eigh (not eig) for symmetric matrices---it is faster and guarantees real eigenvalues.
Use np.linalg.svd with full_matrices=False for the economy SVD.
Leverage broadcasting to avoid Python loops: operations on arrays of different shapes are automatically expanded.
Vectorized NumPy code can be 100--1000x faster than equivalent Python loops because it calls optimized C/Fortran BLAS libraries.