Chapter 4 Key Takeaways

Probability Foundations

  • The three axioms of probability (non-negativity, normalization, countable additivity) are the foundation from which all probability theory is derived. They formalize the idea of distributing a "budget" of 1.0 across all possible outcomes.

  • Conditional probability $P(A \mid B) = P(A \cap B) / P(B)$ is the mechanism for updating beliefs when new information is observed. The product rule $P(A \cap B) = P(A \mid B) P(B)$ is the basis for factoring joint distributions.

  • Bayes' theorem $P(H \mid D) = P(D \mid H) P(H) / P(D)$ is the principled update rule for inference. The posterior combines the likelihood (data evidence) with the prior (existing beliefs). The base rate fallacy shows why accounting for priors is critical.

  • Conditional independence is the key simplifying assumption in probabilistic models. It reduces exponential parameter spaces to linear ones, as exemplified by Naive Bayes classifiers.

Distributions

  • Bernoulli and Categorical distributions model discrete outcomes and appear in binary and multi-class classification. Gaussian distributions model continuous quantities and appear throughout AI, from weight initialization to noise modeling.

  • The Central Limit Theorem justifies the ubiquity of the Gaussian: sums of many independent random variables converge to a Gaussian regardless of their individual distributions.

  • Expectation ($\mathbb{E}[X]$) and variance ($\text{Var}(X)$) are the most important summary statistics. Expectation is linear; variance is not. The covariance matrix extends these concepts to multivariate data, connecting back to the matrix theory from Chapter 2.

Statistical Estimation

  • Maximum Likelihood Estimation (MLE) finds parameters that maximize $p(\mathcal{D} \mid \theta)$. It is the most data-driven approach and is equivalent to minimizing cross-entropy loss. Always work with the log-likelihood for numerical stability.

  • Maximum A Posteriori (MAP) estimation adds a prior term: $\hat{\theta}_{\text{MAP}} = \arg\max [\log p(\mathcal{D} \mid \theta) + \log p(\theta)]$. A Gaussian prior yields L2 regularization; a Laplace prior yields L1 regularization.

  • As the dataset size $n \to \infty$, the MAP estimate converges to the MLE -- the data overwhelms the prior. Priors matter most when data is scarce.

  • The bias-variance tradeoff is embodied in the MLE vs. MAP choice: MLE has no bias from priors but higher variance; MAP introduces bias but reduces variance.

Information Theory

  • Entropy $H(X) = -\sum p(x) \log p(x)$ measures uncertainty. Maximum entropy corresponds to the uniform distribution; zero entropy means the outcome is deterministic.

  • Cross-entropy $H(p, q) = -\sum p(x) \log q(x)$ is the standard classification loss function. Minimizing cross-entropy during training is equivalent to performing MLE and minimizing KL divergence.

  • KL divergence $D_{\text{KL}}(p \| q) = H(p, q) - H(p)$ measures the "distance" from $q$ to $p$. It is non-negative, asymmetric, and not a true metric. Forward KL is mass-covering; reverse KL is mode-seeking.

  • Mutual information $I(X; Y) = H(X) - H(X \mid Y)$ captures all statistical dependence between variables, including nonlinear relationships. It equals zero if and only if the variables are independent.

  • The data processing inequality ($I(X; Z) \leq I(X; Y)$ for $X \to Y \to Z$) constrains what neural networks can learn: information lost in early layers cannot be recovered.

Practical Essentials

  • Always work in log-space when dealing with products of probabilities to prevent numerical underflow.

  • Use the log-sum-exp trick for numerically stable computation of $\log \sum \exp(a_i)$.

  • Laplace smoothing prevents zero probabilities for unseen events and has a Bayesian interpretation as MAP estimation with a Beta prior.

  • Softmax converts logits to probabilities; its temperature parameter controls the sharpness of the distribution, which is critical for controlling generation diversity in language models.

Connections Across the Book

Concept Where It Appears Next
Bayes' theorem Bayesian methods (Ch. 10), Naive Bayes (Ch. 6)
Gaussian distribution Weight initialization (Ch. 11), VAEs (Ch. 16)
MLE / Cross-entropy loss Training neural networks (Ch. 12), language models (Ch. 21)
MAP / Regularization Regularization and generalization (Ch. 13)
KL divergence VAEs (Ch. 16), RLHF (Ch. 25), knowledge distillation (Ch. 33)
Mutual information Feature selection (Ch. 9), contrastive learning (Ch. 16)
Softmax / Temperature Transformer attention (Ch. 18), text generation (Ch. 21)
Entropy / Perplexity Language model evaluation (Ch. 22)