Chapter 4 Further Reading
Foundational Textbooks
Probability and Statistics
-
Bertsekas, D. P. & Tsitsiklis, J. N. (2008). Introduction to Probability (2nd ed.). Athena Scientific. An outstanding introduction to probability theory with a good balance of rigor and intuition. Particularly strong on conditional probability, Bayes' theorem, and random variables. Recommended as a primary reference for Sections 4.1 and 4.2.
-
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. A compact but thorough treatment of both probability and statistics, aimed at readers with mathematical maturity. Covers MLE, MAP estimation, and Bayesian inference with clarity. Excellent companion to Sections 4.3 and 4.4.
-
Casella, G. & Berger, R. L. (2002). Statistical Inference (2nd ed.). Cengage Learning. The standard graduate-level statistics textbook. Provides rigorous treatment of estimation theory, including properties of MLE, sufficient statistics, and the Cramer-Rao bound. Best for readers seeking deep understanding of the material in Section 4.3.
Information Theory
-
Cover, T. M. & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley. The definitive textbook on information theory. Covers entropy, KL divergence, mutual information, the data processing inequality, and rate-distortion theory with mathematical rigor. Chapters 2-4 of this book directly support Section 4.5 of our text.
-
MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press. A uniquely accessible book that weaves together information theory, Bayesian inference, and machine learning. Freely available online. Highly recommended for readers who want to see the deep connections between these fields.
Machine Learning Perspective
-
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 1-2 provide an excellent machine-learning-oriented introduction to probability, distributions, and Bayesian inference. The treatment of the exponential family of distributions and conjugate priors is particularly relevant.
-
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. A modern treatment of probability and statistics for machine learning. Covers all the topics in this chapter with clear explanations and connections to current practice. Freely available online.
-
Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. The advanced companion volume, covering variational inference, the information bottleneck, and other advanced topics previewed in this chapter.
Key Research Papers
Information Theory in Deep Learning
-
Shannon, C. E. (1948). "A Mathematical Theory of Communication." Bell System Technical Journal, 27(3), 379-423. The founding paper of information theory. Introduces entropy, mutual information, and the source coding theorem. One of the most influential papers in the history of science.
-
Tishby, N., Pereira, F. C., & Bialek, W. (2000). "The Information Bottleneck Method." arXiv preprint physics/0004057. Introduces the information bottleneck framework, which formulates representation learning as an information-theoretic optimization problem. The theoretical basis for Case Study 2's discussion.
-
Shwartz-Ziv, R. & Tishby, N. (2017). "Opening the Black Box of Deep Neural Networks via Information." arXiv preprint arXiv:1703.00810. Applies the information bottleneck theory to analyze deep learning, tracking mutual information in hidden layers during training. Controversial but influential.
-
Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B. D., & Cox, D. D. (2019). "On the Information Bottleneck Theory of Deep Learning." Journal of Statistical Mechanics: Theory and Experiment. A critical analysis of the information bottleneck theory, showing that the compression phase depends on the choice of activation function. Important for a balanced view.
KL Divergence and Variational Methods
-
Kingma, D. P. & Welling, M. (2014). "Auto-Encoding Variational Bayes." Proceedings of ICLR. Introduces the VAE, which uses KL divergence as a regularizer in the ELBO objective. The reparameterization trick discussed in Section 4.6.4 originates here.
-
Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). "Variational Inference: A Review for Statisticians." Journal of the American Statistical Association, 112(518), 859-877. A comprehensive review of variational inference, explaining the role of forward and reverse KL divergence in different approximation strategies.
Mutual Information in Machine Learning
-
Oord, A. van den, Li, Y., & Vinyals, O. (2018). "Representation Learning with Contrastive Predictive Coding." arXiv preprint arXiv:1807.03748. Introduces the InfoNCE loss, a practical lower bound on mutual information used for representation learning. Connects directly to the mutual information concepts in Section 4.5.4.
-
Belghazi, M. I., et al. (2018). "Mutual Information Neural Estimation." Proceedings of ICML. Proposes MINE, a neural network-based estimator for mutual information. Shows how to scale mutual information estimation to high-dimensional data.
Online Resources
-
3Blue1Brown: "Bayes theorem, the geometry of changing beliefs" (YouTube) An outstanding visual explanation of Bayes' theorem that builds geometric intuition. Excellent supplement to Section 4.1.5.
-
StatQuest with Josh Starmer: Probability and Statistics playlist (YouTube) Clear, visual explanations of probability distributions, MLE, and related topics. Good for readers who benefit from video instruction.
-
Colah's Blog: "Visual Information Theory" (colah.github.io) A beautifully illustrated introduction to information theory with interactive visualizations. Particularly good for building intuition about entropy, cross-entropy, and KL divergence.
-
Stanford CS229 Lecture Notes on Generative Learning Algorithms Andrew Ng's notes on Naive Bayes and generative/discriminative models provide additional context for Case Study 1.
Historical Context
-
Laplace, P.-S. (1812). Theorie Analytique des Probabilites. The foundational work that established many of the concepts in probability theory, including the "sunrise problem" that motivated Laplace's rule of succession (a precursor to Laplace smoothing).
-
Kolmogorov, A. N. (1933). Grundbegriffe der Wahrscheinlichkeitsrechnung. The axiomatization of probability theory that provides the rigorous foundation used in Section 4.1.1.
-
Fisher, R. A. (1922). "On the Mathematical Foundations of Theoretical Statistics." Philosophical Transactions of the Royal Society of London. The paper that introduced maximum likelihood estimation, one of the most important contributions to statistics.
Suggested Reading Order
For readers new to probability and information theory, we recommend:
- Start with Bishop Ch. 1-2 or Murphy (2022) Ch. 1-4 for a machine-learning-focused introduction
- Deepen understanding with Bertsekas & Tsitsiklis for probability or Cover & Thomas for information theory
- Connect to practice with MacKay's book, which beautifully bridges theory and application
- Explore the frontier with the research papers listed above, particularly Shannon (1948) and Kingma & Welling (2014)