Chapter 3: Further Reading
Probability Theory and Statistical Inference
1. Larry Wasserman, All of Statistics: A Concise Course in Statistical Inference (Springer, 2004)
The single best reference for the level of probability and inference covered in this chapter. Wasserman covers probability, estimation, hypothesis testing, Bayesian inference, and nonparametric methods with a precision and economy rare in statistics textbooks. The treatment is rigorous but concise — he proves what matters and states what does not need proving. Chapters 2-5 (Random Variables, Expectation, Inequalities, Convergence) formalize the LLN, CLT, and concentration inequalities at exactly the level this book assumes. Chapter 9 (Parametric Inference) provides the MLE, Fisher information, and Cramér-Rao bound derivations. Chapter 11 (Bayesian Inference) gives a fair and practical treatment of the frequentist-Bayesian debate. If you read one reference from this list, make it this one.
Start with: Chapters 2-5 for theory, Chapter 9 for MLE, Chapter 11 for Bayesian methods.
2. George Casella and Roger L. Berger, Statistical Inference, 2nd Edition (Cengage, 2002)
The standard graduate reference for mathematical statistics. Where Wasserman is concise, Casella and Berger are comprehensive. Their treatment of sufficient statistics (Chapter 6), point estimation (Chapter 7), and the theory behind MLE including the exponential family is the definitive reference. The coverage of the Cramér-Rao bound, Rao-Blackwell theorem, and completeness provides the theoretical depth needed to understand why MLE has optimal properties. The exercises are notoriously thorough. Best used as a reference for specific proofs and derivations rather than cover-to-cover reading.
Start with: Chapter 3 (Common Families of Distributions) for exponential family theory, Chapter 7 (Point Estimation) for MLE properties.
3. Christopher M. Bishop, Pattern Recognition and Machine Learning (Springer, 2006)
The bridge between statistical theory and machine learning practice. Bishop's Chapters 1-2 develop probability distributions, Bayesian inference, and MLE in the specific context of ML models — making explicit the connections that traditional statistics textbooks leave implicit. His treatment of the exponential family (Section 2.4) and Bayesian model comparison (Section 3.4) are particularly relevant to this chapter. The graphical models coverage (Chapters 8-9) extends the conditional independence concepts introduced here. Though published in 2006, the mathematical foundations are timeless, and his Bayesian perspective on ML remains deeply influential.
Start with: Chapter 1 (Introduction, especially 1.2 on probability theory), Chapter 2 (Probability Distributions), Section 3.3 (Bayesian Linear Regression as a worked example).
4. Art B. Owen, Monte Carlo Theory, Methods and Examples (2013, available at statweb.stanford.edu)
A comprehensive reference on Monte Carlo methods, freely available online. Owen covers basic Monte Carlo, importance sampling, stratified sampling, and variance reduction techniques at a level of rigor and depth that no other source matches. Chapters 2-3 (Basic Monte Carlo and Variance Reduction) extend the introduction in Section 3.12 of this chapter. Chapter 9 (Importance Sampling) provides the theoretical foundation for understanding when importance sampling works and when it fails catastrophically — essential background for the variational inference methods in Part IV. The bootstrap is covered in Chapter 7 with careful attention to the conditions under which it provides valid inference.
Start with: Chapter 2 (Basic Monte Carlo) for foundations, Chapter 9 (Importance Sampling) for the theory behind the technique introduced in this chapter.
5. Bradley Efron and Trevor Hastie, Computer Age Statistical Inference: Algorithms, Evidence, and Data Science (Cambridge, 2016)
A masterful synthesis by two of the most influential statisticians of the past fifty years. This book traces the intellectual arc from frequentist to Bayesian to computational methods, showing how the field evolved and why. Chapter 11 (Bootstrap Confidence Intervals) is the definitive practical guide — it covers the percentile method, the BCa method, and the conditions under which each is appropriate. Chapter 20 (Bayesian Inference and MCMC) contextualizes Bayesian methods within the broader history of statistical inference. What makes this book exceptional is its focus on why methods work and when they break — exactly the Understanding Why theme of this textbook.
Start with: Chapter 11 (Bootstrap) for practical guidance, Chapter 20 (Bayesian Inference) for historical and conceptual context, Chapters 1-2 for a beautiful overview of how classical and modern statistics relate.