Chapter 4: Further Reading
Essential Sources
1. David J.C. MacKay, Information Theory, Inference, and Learning Algorithms (Cambridge University Press, 2003)
The single best textbook connecting information theory to machine learning. MacKay's treatment is uniquely readable: he builds from Shannon's axioms through coding theory to Bayesian inference, with worked examples, exercises, and Python-era intuitions (despite predating the deep learning revolution). Chapters 2-4 cover entropy, relative entropy, and mutual information at a pace matching this chapter. Chapters 28-33 on Monte Carlo methods and variational inference provide the deeper treatment that Chapter 20 of this book builds upon. The full text is legally available free online at the author's website — a remarkable gift to the field. Read this if you read nothing else.
2. Thomas M. Cover and Joy A. Thomas, Elements of Information Theory (Wiley, 2nd edition, 2006)
The definitive graduate reference for information theory. More rigorous and comprehensive than MacKay, but less machine-learning-oriented. The proofs of Gibbs' inequality (KL non-negativity), the data processing inequality, and the maximum entropy principle are clean and illuminating. Chapter 2 (Entropy, Relative Entropy, and Mutual Information) and Chapter 11 (Information Theory and Statistics) are directly relevant to this chapter. Chapter 10 (Rate-Distortion Theory) provides the foundation for the information bottleneck. Recommended for readers who want the full mathematical treatment and the connections to coding theory that Shannon originally intended.
3. Claude E. Shannon, "A Mathematical Theory of Communication" (Bell System Technical Journal, 1948)
The paper that created the field. Remarkably readable for a 76-year-old technical paper. Shannon introduces entropy, proves its uniqueness from axioms, defines channel capacity, and establishes the fundamental limits of reliable communication — all in a single paper. Sections I-IV cover the discrete case and are accessible with the background from this chapter. Reading the original gives you an appreciation for the elegance of Shannon's thinking and the breadth of his vision. Available freely at various online archives.
Reading guidance: Start with Sections I-III (discrete noiseless and noisy channels). Skip Section V (continuous channels) on first reading unless you are interested in analog signal processing. The historical context matters: Shannon was solving a practical engineering problem (how to transmit telegraph messages reliably), and the abstraction he created turned out to be one of the most powerful frameworks in all of science.
4. Naftali Tishby, Fernando C. Pereira, and William Bialek, "The Information Bottleneck Method" (1999); Ravid Shwartz-Ziv and Naftali Tishby, "Opening the Black Box of Deep Neural Networks via Information" (2017)
Two papers that bookend the information bottleneck story. The 1999 paper introduces the IB framework as a principled approach to lossy compression: find a representation of the input that preserves maximal information about the target while minimizing information about the input. The 2017 paper applies this framework to deep learning, claiming that DNNs exhibit two training phases (fitting then compression) and that the compression phase explains generalization.
Reading guidance: Start with the 2017 paper (more accessible, more relevant to modern ML). Then read the critique by Saxe et al., "On the Information Bottleneck Theory of Deep Learning" (ICLR 2018), which shows that the compression phenomenon depends on activation functions and MI estimation methods. The IB framework remains a valuable conceptual tool for thinking about representations, even if the specific claims about training dynamics are debated.
5. Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov, "Importance Weighted Autoencoders" (ICLR, 2016)
This paper provides one of the clearest derivations of the ELBO and its limitations, then proposes a tighter bound using importance weighting. Reading it bridges the ELBO preview in this chapter to the VAE treatment in Chapter 12. The first three pages — covering the standard VAE objective, the ELBO decomposition, and the gap between the ELBO and the true log-evidence — are an excellent complement to Section 4.13 of this chapter.
Reading guidance: Focus on Sections 1-3. The importance-weighted bound in Section 4 is elegant but requires comfort with importance sampling (covered briefly in Chapter 3; treated fully in Chapter 20). If you find the ELBO derivation in this chapter too condensed, this paper's step-by-step walkthrough will fill the gaps.