Chapter 4: Key Takeaways

  1. Information is surprise, and entropy is average surprise. The information content of an event is $I(x) = -\log p(x)$: rare events carry more information. Entropy $H(X) = \mathbb{E}[-\log p(X)]$ measures the average uncertainty in a distribution. Low entropy means predictability; high entropy means uncertainty. This is the foundation for everything else in this chapter.

  2. Cross-entropy loss = negative log-likelihood = KL divergence (up to a constant). The chain of equivalences $\text{MLE} \Leftrightarrow \text{minimize NLL} \Leftrightarrow \text{minimize cross-entropy} \Leftrightarrow \text{minimize } D_{\text{KL}}(p_{\text{data}} \| q_{\text{model}})$ explains why cross-entropy is the canonical classification loss. It is not an arbitrary choice — it makes the model distribution as close as possible to the data distribution in the information-theoretic sense.

  3. Mutual information captures nonlinear dependencies that correlation misses. Pearson correlation measures linear association; it can be zero even when variables are perfectly dependent (e.g., $Y = X^2$). Mutual information $I(X; Y) = 0$ if and only if $X$ and $Y$ are truly independent. For feature selection, MI-based ranking catches nonlinear and categorical effects that correlation-based methods discard.

  4. The data processing inequality constrains what neural networks can do. No transformation of data can create information — it can only preserve or destroy it. This means each layer of a neural network has at most as much information about the input as the previous layer. Good representations selectively discard irrelevant information while preserving task-relevant information (the information bottleneck principle).

  5. KL divergence is asymmetric, and the direction matters. Forward KL ($D_{\text{KL}}(p \| q)$, used in MLE) is mean-seeking: the model tries to cover all modes of the true distribution. Reverse KL ($D_{\text{KL}}(q \| p)$, used in variational inference) is mode-seeking: the approximation concentrates on one mode. This asymmetry determines the qualitative behavior of different training procedures.

  6. The ELBO is the bridge between information theory and Bayesian inference. The Evidence Lower Bound decomposes into an expected log-likelihood (data fit) minus a KL divergence from the prior (regularization). This decomposition is the engine of variational autoencoders, variational inference, and modern Bayesian deep learning. Understanding it here — in its information-theoretic home — makes it recognizable when it reappears in Chapters 12 and 20.

  7. Entropy decomposition separates what you know from what you do not. Total predictive uncertainty can be split into aleatoric (inherent noise, irreducible) and epistemic (model disagreement, potentially reducible) components using mutual information. This decomposition matters in practice: epistemic uncertainty signals where more data or better models could help, while aleatoric uncertainty signals a fundamental limit.