Chapter 4 Key Takeaways
Probability Foundations
-
The three axioms of probability (non-negativity, normalization, countable additivity) are the foundation from which all probability theory is derived. They formalize the idea of distributing a "budget" of 1.0 across all possible outcomes.
-
Conditional probability $P(A \mid B) = P(A \cap B) / P(B)$ is the mechanism for updating beliefs when new information is observed. The product rule $P(A \cap B) = P(A \mid B) P(B)$ is the basis for factoring joint distributions.
-
Bayes' theorem $P(H \mid D) = P(D \mid H) P(H) / P(D)$ is the principled update rule for inference. The posterior combines the likelihood (data evidence) with the prior (existing beliefs). The base rate fallacy shows why accounting for priors is critical.
-
Conditional independence is the key simplifying assumption in probabilistic models. It reduces exponential parameter spaces to linear ones, as exemplified by Naive Bayes classifiers.
Distributions
-
Bernoulli and Categorical distributions model discrete outcomes and appear in binary and multi-class classification. Gaussian distributions model continuous quantities and appear throughout AI, from weight initialization to noise modeling.
-
The Central Limit Theorem justifies the ubiquity of the Gaussian: sums of many independent random variables converge to a Gaussian regardless of their individual distributions.
-
Expectation ($\mathbb{E}[X]$) and variance ($\text{Var}(X)$) are the most important summary statistics. Expectation is linear; variance is not. The covariance matrix extends these concepts to multivariate data, connecting back to the matrix theory from Chapter 2.
Statistical Estimation
-
Maximum Likelihood Estimation (MLE) finds parameters that maximize $p(\mathcal{D} \mid \theta)$. It is the most data-driven approach and is equivalent to minimizing cross-entropy loss. Always work with the log-likelihood for numerical stability.
-
Maximum A Posteriori (MAP) estimation adds a prior term: $\hat{\theta}_{\text{MAP}} = \arg\max [\log p(\mathcal{D} \mid \theta) + \log p(\theta)]$. A Gaussian prior yields L2 regularization; a Laplace prior yields L1 regularization.
-
As the dataset size $n \to \infty$, the MAP estimate converges to the MLE -- the data overwhelms the prior. Priors matter most when data is scarce.
-
The bias-variance tradeoff is embodied in the MLE vs. MAP choice: MLE has no bias from priors but higher variance; MAP introduces bias but reduces variance.
Information Theory
-
Entropy $H(X) = -\sum p(x) \log p(x)$ measures uncertainty. Maximum entropy corresponds to the uniform distribution; zero entropy means the outcome is deterministic.
-
Cross-entropy $H(p, q) = -\sum p(x) \log q(x)$ is the standard classification loss function. Minimizing cross-entropy during training is equivalent to performing MLE and minimizing KL divergence.
-
KL divergence $D_{\text{KL}}(p \| q) = H(p, q) - H(p)$ measures the "distance" from $q$ to $p$. It is non-negative, asymmetric, and not a true metric. Forward KL is mass-covering; reverse KL is mode-seeking.
-
Mutual information $I(X; Y) = H(X) - H(X \mid Y)$ captures all statistical dependence between variables, including nonlinear relationships. It equals zero if and only if the variables are independent.
-
The data processing inequality ($I(X; Z) \leq I(X; Y)$ for $X \to Y \to Z$) constrains what neural networks can learn: information lost in early layers cannot be recovered.
Practical Essentials
-
Always work in log-space when dealing with products of probabilities to prevent numerical underflow.
-
Use the log-sum-exp trick for numerically stable computation of $\log \sum \exp(a_i)$.
-
Laplace smoothing prevents zero probabilities for unseen events and has a Bayesian interpretation as MAP estimation with a Beta prior.
-
Softmax converts logits to probabilities; its temperature parameter controls the sharpness of the distribution, which is critical for controlling generation diversity in language models.
Connections Across the Book
| Concept | Where It Appears Next |
|---|---|
| Bayes' theorem | Bayesian methods (Ch. 10), Naive Bayes (Ch. 6) |
| Gaussian distribution | Weight initialization (Ch. 11), VAEs (Ch. 16) |
| MLE / Cross-entropy loss | Training neural networks (Ch. 12), language models (Ch. 21) |
| MAP / Regularization | Regularization and generalization (Ch. 13) |
| KL divergence | VAEs (Ch. 16), RLHF (Ch. 25), knowledge distillation (Ch. 33) |
| Mutual information | Feature selection (Ch. 9), contrastive learning (Ch. 16) |
| Softmax / Temperature | Transformer attention (Ch. 18), text generation (Ch. 21) |
| Entropy / Perplexity | Language model evaluation (Ch. 22) |