Chapter 2: Key Takeaways

Vectors and Vector Spaces

  • A vector is an ordered list of numbers that represents a data point, embedding, or parameter set. In AI, vectors range from 2D (for visualization) to hundreds of thousands of dimensions (vocabulary embeddings).
  • The dot product $\mathbf{u} \cdot \mathbf{v} = \sum_i u_i v_i$ measures alignment between vectors. It is the fundamental operation in every neural network forward pass.
  • Cosine similarity normalizes the dot product by vector lengths, making it the standard similarity measure for embeddings: $\cos\theta = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|}$.
  • Norms quantify vector magnitude. The $L^1$ norm encourages sparsity (Lasso), the $L^2$ norm encourages small weights (Ridge), and the $L^\infty$ norm bounds adversarial perturbations.
  • A set of vectors is linearly independent if no vector can be written as a combination of the others. The rank of a matrix tells you how many independent vectors it contains---and thus the true dimensionality of your data.

Matrices and Operations

  • A matrix serves dual roles: as a data container (rows = samples, columns = features) and as a linear transformation (multiplication maps inputs to outputs).
  • Matrix multiplication is the workhorse of AI: every linear layer computes $\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}$. It is associative but not commutative.
  • The transpose flips rows and columns. The identity $(\mathbf{A}\mathbf{B})^T = \mathbf{B}^T\mathbf{A}^T$ appears frequently in gradient derivations.
  • The determinant tells you whether a matrix is invertible ($\det \neq 0$) and how it scales volumes. The trace equals the sum of eigenvalues.
  • Prefer np.linalg.solve(A, b) over np.linalg.inv(A) @ b---it is more numerically stable and computationally efficient.

Linear Transformations

  • Every matrix defines a linear transformation, and every linear transformation can be represented as a matrix. This is why neural networks need non-linear activations between linear layers.
  • The column space is the set of possible outputs; the null space is the set of inputs that map to zero. The rank-nullity theorem connects them: $\text{rank} + \text{nullity} = n$.

Eigendecomposition

  • An eigenvector $\mathbf{v}$ of $\mathbf{A}$ satisfies $\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$: the matrix only scales it, without changing its direction.
  • The eigendecomposition $\mathbf{A} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^{-1}$ separates a transformation into its natural directions ($\mathbf{V}$) and scaling factors ($\boldsymbol{\Lambda}$).
  • For symmetric matrices (covariance matrices, Hessians, kernels), the spectral theorem guarantees real eigenvalues and orthogonal eigenvectors: $\mathbf{A} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^T$.
  • Eigenvalues of the covariance matrix drive PCA; eigenvalues of the Hessian determine optimization landscape properties (minima vs. saddle points).
  • The condition number $\kappa = |\lambda_{\max}|/|\lambda_{\min}|$ measures numerical sensitivity. Large condition numbers cause slow convergence and numerical instability.

Singular Value Decomposition

  • SVD generalizes eigendecomposition to any matrix: $\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T$, decomposing it into left singular vectors, singular values, and right singular vectors.
  • Singular values are always non-negative and ordered: $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$.
  • The Eckart-Young theorem guarantees that the best rank-$k$ approximation is the truncated SVD, making it the mathematical foundation of dimensionality reduction.
  • SVD applications in AI include image compression, latent semantic analysis, recommender systems, computing the pseudoinverse, and data whitening.

Matrix Calculus Preview

  • The gradient $\nabla_{\mathbf{x}} f$ of a scalar function with respect to a vector points in the direction of steepest ascent.
  • The Jacobian $\mathbf{J}$ generalizes the gradient when both input and output are vectors.
  • Key identities: $\nabla_{\mathbf{x}}(\mathbf{a}^T\mathbf{x}) = \mathbf{a}$, $\nabla_{\mathbf{x}}(\mathbf{x}^T\mathbf{A}\mathbf{x}) = 2\mathbf{A}\mathbf{x}$ (for symmetric $\mathbf{A}$), $\nabla_{\mathbf{x}}\|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 = 2\mathbf{A}^T(\mathbf{A}\mathbf{x} - \mathbf{b})$.
  • Backpropagation is the chain rule applied to vector-valued functions: $\frac{\partial L}{\partial \mathbf{x}} = \mathbf{J}^T \frac{\partial L}{\partial \mathbf{y}}$.

Practical NumPy Fluency

  • Use @ for matrix multiplication, * for element-wise multiplication.
  • Use np.linalg.eigh (not eig) for symmetric matrices---it is faster and guarantees real eigenvalues.
  • Use np.linalg.svd with full_matrices=False for the economy SVD.
  • Leverage broadcasting to avoid Python loops: operations on arrays of different shapes are automatically expanded.
  • Vectorized NumPy code can be 100--1000x faster than equivalent Python loops because it calls optimized C/Fortran BLAS libraries.