Chapter 10 Key Takeaways

The Big Picture

Probabilistic and Bayesian methods provide a principled framework for reasoning under uncertainty. Instead of producing single-point predictions, Bayesian models produce full distributions over parameters and predictions, answering not just "What is the best guess?" but also "How confident should we be?" This makes them essential for safety-critical applications, small-data regimes, and decision-making under ambiguity.

Core Concepts at a Glance

Bayesian Inference

Bayes' theorem connects prior beliefs, observed data, and updated beliefs: $p(\theta \mid \mathcal{D}) \propto p(\mathcal{D} \mid \theta) \, p(\theta)$.
The prior $p(\theta)$ encodes what we believe before seeing data. The likelihood $p(\mathcal{D} \mid \theta)$ quantifies how well the data fits a given parameter value. The posterior $p(\theta \mid \mathcal{D})$ is the updated belief after observing data.
Today's posterior becomes tomorrow's prior -- Bayesian updating is naturally sequential.
The posterior predictive distribution integrates over parameter uncertainty to produce predictions: $p(\tilde{y} \mid \mathcal{D}) = \int p(\tilde{y} \mid \theta) \, p(\theta \mid \mathcal{D}) \, d\theta$.

Prior Selection

Conjugate priors make the posterior analytically tractable (Beta-Binomial, Normal-Normal, Gamma-Poisson).
Weakly informative priors regularize without dominating the data (e.g., broad Gaussians).
Informative priors encode domain knowledge (e.g., historical data from previous experiments).
Always perform prior predictive checks: sample from the prior and verify that the implied predictions are plausible.
Sensitivity analysis verifies that conclusions are robust across reasonable prior choices.

MAP vs. Full Bayesian

The Maximum A Posteriori (MAP) estimate finds the single most probable parameter value. It is equivalent to regularized maximum likelihood (L2 regularization corresponds to a Gaussian prior).
Full Bayesian inference integrates over the entire posterior, capturing uncertainty. This is more expensive but more informative than MAP.

Models and Algorithms

Naive Bayes Classifiers

Assume features are conditionally independent given the class: $p(\mathbf{x} \mid y) = \prod_j p(x_j \mid y)$.
Gaussian Naive Bayes models each feature as a Gaussian per class.
Multinomial Naive Bayes is the workhorse for text classification (bag-of-words features).
Despite the strong independence assumption, Naive Bayes is often surprisingly competitive, especially with limited data.
Training is extremely fast ($\mathcal{O}(n \cdot d)$) and requires no gradient optimization.

Bayesian Linear Regression

Places a prior on the weight vector: $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \sigma_w^2 \mathbf{I})$.
The posterior is Gaussian with closed-form mean and covariance.
Predictions include uncertainty bands that widen in regions with fewer observations.
L2 regularization (Ridge regression) is the MAP special case of Bayesian linear regression with a Gaussian prior.

Markov Chain Monte Carlo (MCMC)

When the posterior is intractable, MCMC generates samples from it by constructing a Markov chain whose stationary distribution is the target posterior.
Metropolis-Hastings proposes new states and accepts/rejects based on the posterior ratio. Simple to implement but can be slow to converge.
Gibbs sampling samples each parameter from its full conditional distribution. Efficient when conditionals are available but mixes poorly with strong correlations.
Hamiltonian Monte Carlo (HMC) uses gradient information to propose distant, high-probability states. More efficient in high dimensions.
NUTS (No-U-Turn Sampler) auto-tunes HMC's trajectory length, eliminating a critical tuning parameter.
MCMC diagnostics include trace plots, autocorrelation, effective sample size, and the Gelman-Rubin $\hat{R}$ statistic.

Variational Inference

Approximates the posterior with a simpler distribution $q(\theta)$ by maximizing the Evidence Lower Bound (ELBO).
Mean-field approximation assumes all parameters are independent under $q$ -- fast but may miss correlations.
Trades accuracy for speed: variational inference scales better than MCMC to large datasets and high-dimensional models.
The quality of the approximation depends on the expressiveness of the variational family.

Gaussian Processes

A Gaussian process (GP) defines a distribution over functions: any finite collection of function values is jointly Gaussian.
Fully specified by a mean function (usually zero) and a kernel (covariance) function.
RBF kernel: smooth functions with a characteristic length scale.
Periodic kernel: captures periodic patterns.
Composite kernels: sums and products of kernels model complex structure (trend + seasonality + noise).
GP regression provides exact posterior predictive distributions with calibrated uncertainty.
Marginal likelihood provides a principled way to optimize kernel hyperparameters and compare kernel structures.
Computational cost is $\mathcal{O}(n^3)$ due to matrix inversion, limiting GPs to moderate datasets. Sparse approximations (inducing points) reduce this to $\mathcal{O}(nm^2)$ where $m \ll n$.

Practical Guidelines

When to Use Bayesian Methods

Scenario	Recommendation
Small datasets	Bayesian methods excel -- priors regularize effectively
Uncertainty required	Use full Bayesian inference or GPs
Large datasets, speed critical	Consider MAP estimation or variational inference
Text classification baseline	Naive Bayes is hard to beat for simplicity
Black-box optimization	Bayesian optimization with GP surrogates
Sequential decision-making	Thompson sampling with Bayesian posteriors

Common Pitfalls

Ignoring prior sensitivity: Always check that your conclusions hold under different reasonable priors.
Trusting unchecked MCMC: Always examine trace plots, $\hat{R}$, and effective sample size before interpreting results.
Overly vague priors: "Non-informative" priors can be improper or lead to poor performance. Weakly informative priors are generally preferred.
Forgetting computational cost: Full Bayesian inference with MCMC can be orders of magnitude slower than point estimation. Budget your compute accordingly.
Confusing credible intervals with confidence intervals: A 95% credible interval means there is a 95% probability the parameter lies within the interval (given the model and data). This is a direct probability statement, unlike a frequentist confidence interval.

Bayesian vs. Frequentist: A Pragmatic View

Bayesian methods treat parameters as random variables with distributions; frequentist methods treat them as fixed but unknown.
In practice, the choice often matters less than correct implementation: both approaches converge with enough data.
Bayesian methods shine with small data, sequential updating, and when uncertainty quantification is needed.
Frequentist methods are often simpler and faster for large-scale problems.
Many modern techniques (regularization, dropout, ensemble methods) have both Bayesian and frequentist interpretations.

Key Equations

Concept	Equation
Bayes' theorem	$p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \theta) \, p(\theta)}{p(\mathcal{D})}$
Posterior predictive	$p(\tilde{y} \mid \mathcal{D}) = \int p(\tilde{y} \mid \theta) \, p(\theta \mid \mathcal{D}) \, d\theta$
Beta-Binomial update	$\text{Beta}(\alpha, \beta) + (h \text{ heads}, t \text{ tails}) \rightarrow \text{Beta}(\alpha + h, \beta + t)$
GP predictive mean	$\boldsymbol{\mu}_* = \mathbf{K}_*^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y}$
GP predictive variance	$\sigma_^2 = k_{} - \mathbf{K}_^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{K}_*$
ELBO	$\mathcal{L}(q) = \mathbb{E}_q[\log p(\mathcal{D} \mid \theta)] - \text{KL}(q(\theta) \\| p(\theta))$

Connections to Other Chapters

Chapter 4 (Probability): Provides the mathematical foundations for Bayesian inference.
Chapter 6 (Supervised Learning): Ridge and Lasso regression are MAP estimates under Gaussian and Laplace priors, respectively.
Chapter 8 (Model Evaluation): Marginal likelihood provides a Bayesian alternative to cross-validation for model comparison.
Chapter 9 (Feature Engineering): Bayesian methods can guide feature selection through posterior inclusion probabilities.
Part III (Deep Learning): Bayesian neural networks and MC Dropout extend Bayesian thinking to deep learning.