Chapter 2: Key Takeaways
Vectors and Vector Spaces
- A vector is an ordered list of numbers that represents a data point, embedding, or parameter set. In AI, vectors range from 2D (for visualization) to hundreds of thousands of dimensions (vocabulary embeddings).
- The dot product $\mathbf{u} \cdot \mathbf{v} = \sum_i u_i v_i$ measures alignment between vectors. It is the fundamental operation in every neural network forward pass.
- Cosine similarity normalizes the dot product by vector lengths, making it the standard similarity measure for embeddings: $\cos\theta = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|}$.
- Norms quantify vector magnitude. The $L^1$ norm encourages sparsity (Lasso), the $L^2$ norm encourages small weights (Ridge), and the $L^\infty$ norm bounds adversarial perturbations.
- A set of vectors is linearly independent if no vector can be written as a combination of the others. The rank of a matrix tells you how many independent vectors it contains---and thus the true dimensionality of your data.
Matrices and Operations
- A matrix serves dual roles: as a data container (rows = samples, columns = features) and as a linear transformation (multiplication maps inputs to outputs).
- Matrix multiplication is the workhorse of AI: every linear layer computes $\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}$. It is associative but not commutative.
- The transpose flips rows and columns. The identity $(\mathbf{A}\mathbf{B})^T = \mathbf{B}^T\mathbf{A}^T$ appears frequently in gradient derivations.
- The determinant tells you whether a matrix is invertible ($\det \neq 0$) and how it scales volumes. The trace equals the sum of eigenvalues.
- Prefer
np.linalg.solve(A, b)overnp.linalg.inv(A) @ b---it is more numerically stable and computationally efficient.
Linear Transformations
- Every matrix defines a linear transformation, and every linear transformation can be represented as a matrix. This is why neural networks need non-linear activations between linear layers.
- The column space is the set of possible outputs; the null space is the set of inputs that map to zero. The rank-nullity theorem connects them: $\text{rank} + \text{nullity} = n$.
Eigendecomposition
- An eigenvector $\mathbf{v}$ of $\mathbf{A}$ satisfies $\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$: the matrix only scales it, without changing its direction.
- The eigendecomposition $\mathbf{A} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^{-1}$ separates a transformation into its natural directions ($\mathbf{V}$) and scaling factors ($\boldsymbol{\Lambda}$).
- For symmetric matrices (covariance matrices, Hessians, kernels), the spectral theorem guarantees real eigenvalues and orthogonal eigenvectors: $\mathbf{A} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^T$.
- Eigenvalues of the covariance matrix drive PCA; eigenvalues of the Hessian determine optimization landscape properties (minima vs. saddle points).
- The condition number $\kappa = |\lambda_{\max}|/|\lambda_{\min}|$ measures numerical sensitivity. Large condition numbers cause slow convergence and numerical instability.
Singular Value Decomposition
- SVD generalizes eigendecomposition to any matrix: $\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T$, decomposing it into left singular vectors, singular values, and right singular vectors.
- Singular values are always non-negative and ordered: $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$.
- The Eckart-Young theorem guarantees that the best rank-$k$ approximation is the truncated SVD, making it the mathematical foundation of dimensionality reduction.
- SVD applications in AI include image compression, latent semantic analysis, recommender systems, computing the pseudoinverse, and data whitening.
Matrix Calculus Preview
- The gradient $\nabla_{\mathbf{x}} f$ of a scalar function with respect to a vector points in the direction of steepest ascent.
- The Jacobian $\mathbf{J}$ generalizes the gradient when both input and output are vectors.
- Key identities: $\nabla_{\mathbf{x}}(\mathbf{a}^T\mathbf{x}) = \mathbf{a}$, $\nabla_{\mathbf{x}}(\mathbf{x}^T\mathbf{A}\mathbf{x}) = 2\mathbf{A}\mathbf{x}$ (for symmetric $\mathbf{A}$), $\nabla_{\mathbf{x}}\|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 = 2\mathbf{A}^T(\mathbf{A}\mathbf{x} - \mathbf{b})$.
- Backpropagation is the chain rule applied to vector-valued functions: $\frac{\partial L}{\partial \mathbf{x}} = \mathbf{J}^T \frac{\partial L}{\partial \mathbf{y}}$.
Practical NumPy Fluency
- Use
@for matrix multiplication,*for element-wise multiplication. - Use
np.linalg.eigh(noteig) for symmetric matrices---it is faster and guarantees real eigenvalues. - Use
np.linalg.svdwithfull_matrices=Falsefor the economy SVD. - Leverage broadcasting to avoid Python loops: operations on arrays of different shapes are automatically expanded.
- Vectorized NumPy code can be 100--1000x faster than equivalent Python loops because it calls optimized C/Fortran BLAS libraries.