In the preceding chapters, we built models that produced single-point predictions: a classifier outputs a label, a regressor outputs a number. But how confident should we be in those predictions? What if the training data is sparse? What if multiple...
In This Chapter
- 10.1 Bayesian Thinking: A Review and Deepening
- 10.2 Naive Bayes Classifiers
- 10.3 Bayesian Linear Regression
- 10.4 Prior Selection: The Art and Science
- 10.5 Conjugate Priors
- 10.6 Markov Chain Monte Carlo (MCMC)
- 10.7 Variational Inference
- 10.8 Gaussian Processes
- 10.9 Bayesian Neural Networks
- 10.10 Probabilistic Programming
- 10.11 Bayesian vs. Frequentist Perspectives
- 10.12 Connection to Modern Generative Models
- 10.13 Practical Bayesian Modeling with PyMC
- 10.14 Putting It All Together: A Bayesian Workflow
- 10.15 Practical Guidelines for Bayesian AI Engineering
- 10.16 Summary
- References
Chapter 10: Probabilistic and Bayesian Methods
"Probability is the very guide of life." -- Cicero
In the preceding chapters, we built models that produced single-point predictions: a classifier outputs a label, a regressor outputs a number. But how confident should we be in those predictions? What if the training data is sparse? What if multiple hypotheses explain the data equally well? These are the questions that probabilistic and Bayesian methods answer with remarkable elegance.
This chapter takes the probability foundations we established in Chapter 4 and transforms them into a complete engineering framework. We will move from the abstract rules of probability to concrete algorithms that quantify uncertainty, update beliefs with data, and make decisions under ambiguity -- skills that separate robust AI systems from brittle ones.
By the end of this chapter, you will be able to:
- Apply Bayesian reasoning to update beliefs as new evidence arrives
- Implement Naive Bayes classifiers and understand their surprising effectiveness
- Perform Bayesian linear regression with full uncertainty quantification
- Select priors that encode domain knowledge without overwhelming the data
- Use Markov chain Monte Carlo (MCMC) to sample from complex posterior distributions
- Understand variational inference as a scalable alternative to MCMC
- Build and interpret Gaussian process models for nonparametric regression
- Articulate the practical differences between Bayesian and frequentist perspectives
10.1 Bayesian Thinking: A Review and Deepening
In Chapter 4, we introduced Bayes' theorem as a rule for inverting conditional probabilities. Now we elevate it from a formula to a philosophy of learning.
10.1.1 The Bayesian Update Cycle
At its core, Bayesian inference follows a three-step cycle:
- Specify a prior $p(\theta)$: What do we believe about the parameters before seeing data?
- Define a likelihood $p(\mathcal{D} \mid \theta)$: How probable is the observed data given specific parameter values?
- Compute the posterior $p(\theta \mid \mathcal{D})$: What should we believe after seeing data?
Bayes' theorem connects these quantities:
$$ p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \theta) \, p(\theta)}{p(\mathcal{D})} $$
where $p(\mathcal{D}) = \int p(\mathcal{D} \mid \theta) \, p(\theta) \, d\theta$ is the marginal likelihood (or evidence), serving as a normalizing constant.
The beauty of this framework is its sequential nature. Today's posterior becomes tomorrow's prior when new data arrives:
$$ p(\theta \mid \mathcal{D}_1, \mathcal{D}_2) \propto p(\mathcal{D}_2 \mid \theta) \, p(\theta \mid \mathcal{D}_1) $$
This makes Bayesian methods naturally suited to online learning and streaming data -- a topic we revisit in later chapters.
10.1.2 A Concrete Example: Estimating a Coin's Bias
Suppose we want to estimate the probability $\theta$ that a coin lands heads. We flip it 10 times and observe 7 heads.
Frequentist approach: The maximum likelihood estimate is $\hat{\theta} = 7/10 = 0.7$.
Bayesian approach: We place a Beta prior $p(\theta) = \text{Beta}(\alpha, \beta)$ on $\theta$. If we choose $\alpha = \beta = 1$ (a uniform prior expressing no preference), the posterior after observing $h = 7$ heads and $t = 3$ tails is:
$$ p(\theta \mid h, t) = \text{Beta}(\alpha + h, \beta + t) = \text{Beta}(8, 4) $$
The posterior mean is $8/12 \approx 0.667$, slightly pulled toward 0.5 compared to the MLE. More importantly, we get a full distribution over $\theta$, not just a point estimate. We can compute credible intervals, the probability that $\theta > 0.5$, or any other quantity of interest.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Prior parameters
alpha_prior, beta_prior = 1, 1
# Observed data
heads, tails = 7, 3
# Posterior parameters
alpha_post = alpha_prior + heads
beta_post = beta_prior + tails
# Create distributions
theta = np.linspace(0, 1, 200)
prior = stats.beta(alpha_prior, beta_prior)
likelihood_unnorm = stats.binom.pmf(heads, heads + tails, theta)
posterior = stats.beta(alpha_post, beta_post)
# 95% credible interval
ci_low, ci_high = posterior.ppf(0.025), posterior.ppf(0.975)
print(f"Posterior mean: {posterior.mean():.3f}")
print(f"95% credible interval: [{ci_low:.3f}, {ci_high:.3f}]")
Posterior mean: 0.667
95% credible interval: [0.390, 0.893]
10.1.3 Why Bayesian Methods Matter in AI Engineering
The Bayesian framework offers several practical advantages for AI engineers:
- Uncertainty quantification: Every prediction comes with a measure of confidence, critical for safety-critical applications (autonomous vehicles, medical diagnosis).
- Regularization through priors: Priors naturally prevent overfitting, especially with small datasets. As we saw in Chapter 8, regularization penalties like L2 correspond to Gaussian priors.
- Model comparison: The marginal likelihood $p(\mathcal{D})$ provides a principled way to compare models, automatically balancing fit and complexity (Occam's razor).
- Data efficiency: By leveraging prior knowledge, Bayesian models can learn effectively from fewer examples.
- Sequential updating: New data is incorporated naturally without retraining from scratch.
10.2 Naive Bayes Classifiers
Despite the word "naive" in its name, Naive Bayes is one of the most practical and widely-deployed Bayesian models. It remains a strong baseline for text classification, spam filtering, and many other tasks.
10.2.1 The Conditional Independence Assumption
Given a feature vector $\mathbf{x} = (x_1, x_2, \ldots, x_d)$ and a class label $y$, Bayes' theorem gives:
$$ p(y \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid y) \, p(y)}{p(\mathbf{x})} $$
The challenge is modeling $p(\mathbf{x} \mid y)$, which requires estimating the joint distribution over all features -- exponentially expensive in general. Naive Bayes makes the conditional independence assumption:
$$ p(\mathbf{x} \mid y) = \prod_{j=1}^{d} p(x_j \mid y) $$
This reduces the problem from estimating one joint distribution to estimating $d$ univariate distributions. The decision rule becomes:
$$ \hat{y} = \arg\max_y \, p(y) \prod_{j=1}^{d} p(x_j \mid y) $$
In practice, we work with log probabilities to avoid numerical underflow:
$$ \hat{y} = \arg\max_y \left[ \log p(y) + \sum_{j=1}^{d} \log p(x_j \mid y) \right] $$
10.2.2 Variants of Naive Bayes
The choice of $p(x_j \mid y)$ determines the variant:
| Variant | Feature Distribution | Typical Use Case |
|---|---|---|
| Gaussian NB | $p(x_j \mid y) = \mathcal{N}(\mu_{jy}, \sigma_{jy}^2)$ | Continuous features |
| Multinomial NB | $p(x_j \mid y) = \frac{n_{jy} + \alpha}{n_y + \alpha d}$ | Word counts, TF-IDF |
| Bernoulli NB | $p(x_j \mid y) = p_{jy}^{x_j}(1-p_{jy})^{1-x_j}$ | Binary features |
| Complement NB | Uses complement class counts | Imbalanced text data |
10.2.3 Why Does Naive Bayes Work?
The conditional independence assumption is almost always violated in practice. Words in a document are correlated; pixels in an image are spatially dependent. Yet Naive Bayes classifiers often perform well. Why?
The key insight, formalized by Domingos and Pazzani (1997), is that classification accuracy depends on the decision boundary, not the probability estimates. Even if $p(y \mid \mathbf{x})$ is poorly calibrated, the ranking of classes may be correct. As long as the true class receives the highest (albeit incorrect) posterior probability, the classifier makes the right decision.
Furthermore, in high-dimensional spaces, the errors from the independence assumption may cancel out across features, leading to surprisingly good class rankings even when individual probability estimates are poor.
10.2.4 Worked Example: Text Classification with Naive Bayes
To make the Naive Bayes algorithm concrete, consider classifying movie reviews as positive or negative. Our training data consists of:
- Positive reviews: "great film loved it", "wonderful movie excellent acting"
- Negative reviews: "terrible movie waste time", "bad film boring plot"
Our vocabulary is: {great, film, loved, it, wonderful, movie, excellent, acting, terrible, waste, time, bad, boring, plot}. We compute $p(w \mid \text{positive})$ and $p(w \mid \text{negative})$ for each word.
For the positive class, the word "movie" appears once out of 8 total words: $p(\text{movie} \mid +) = 1/8 = 0.125$. For the negative class, "movie" appears once out of 8 words: $p(\text{movie} \mid -) = 1/8 = 0.125$. Words like "great" appear only in positive reviews: $p(\text{great} \mid +) = 1/8 = 0.125$, $p(\text{great} \mid -) = 0/8 = 0$.
The zero probability for "great" in the negative class is problematic---if a new review contains "great," the entire product for the negative class becomes zero, regardless of all other words. This is precisely the problem that Laplace smoothing addresses.
10.2.5 Laplace Smoothing
A critical practical issue arises when a feature value never appears with a particular class in training data: $p(x_j \mid y) = 0$, which zeros out the entire product. Laplace smoothing (also called additive smoothing) solves this by adding a pseudocount $\alpha > 0$:
$$ \hat{p}(x_j = v \mid y) = \frac{\text{count}(x_j = v, y) + \alpha}{\text{count}(y) + \alpha \lvert V_j \rvert} $$
where $\lvert V_j \rvert$ is the number of possible values for feature $j$. Setting $\alpha = 1$ gives classic Laplace smoothing; $\alpha < 1$ gives Lidstone smoothing. From a Bayesian perspective, this corresponds to placing a symmetric Dirichlet prior on the multinomial parameters.
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
# Load text data
categories = ['sci.space', 'comp.graphics', 'rec.sport.baseball', 'talk.politics.guns']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)
# Vectorize
vectorizer = TfidfVectorizer(max_features=10000, stop_words='english')
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
# Train Multinomial Naive Bayes with Laplace smoothing
clf = MultinomialNB(alpha=1.0) # alpha=1.0 is Laplace smoothing
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, target_names=categories))
10.3 Bayesian Linear Regression
In Chapter 6, we fit linear models by finding a single best-fit weight vector $\mathbf{w}$. Bayesian linear regression instead maintains a distribution over weight vectors, giving us uncertainty estimates for every prediction.
10.3.1 The Model
We assume:
$$ y = \mathbf{w}^\top \mathbf{x} + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2) $$
This defines the likelihood:
$$ p(\mathbf{y} \mid \mathbf{X}, \mathbf{w}, \sigma^2) = \mathcal{N}(\mathbf{y} \mid \mathbf{X}\mathbf{w}, \sigma^2 \mathbf{I}) $$
We place a Gaussian prior on the weights:
$$ p(\mathbf{w}) = \mathcal{N}(\mathbf{w} \mid \mathbf{m}_0, \mathbf{S}_0) $$
10.3.2 The Posterior (Closed Form)
Because the Gaussian prior is conjugate to the Gaussian likelihood, the posterior is also Gaussian:
$$ p(\mathbf{w} \mid \mathbf{X}, \mathbf{y}, \sigma^2) = \mathcal{N}(\mathbf{w} \mid \mathbf{m}_N, \mathbf{S}_N) $$
where:
$$ \mathbf{S}_N = \left(\mathbf{S}_0^{-1} + \frac{1}{\sigma^2} \mathbf{X}^\top \mathbf{X}\right)^{-1} $$
$$ \mathbf{m}_N = \mathbf{S}_N \left(\mathbf{S}_0^{-1} \mathbf{m}_0 + \frac{1}{\sigma^2} \mathbf{X}^\top \mathbf{y}\right) $$
10.3.3 Predictive Distribution
For a new input $\mathbf{x}_*$, the predictive distribution is also Gaussian:
$$ p(y_* \mid \mathbf{x}_*, \mathbf{X}, \mathbf{y}) = \mathcal{N}(y_* \mid \mathbf{m}_N^\top \mathbf{x}_*, \sigma_*^2) $$
where the predictive variance is:
$$ \sigma_*^2 = \sigma^2 + \mathbf{x}_*^\top \mathbf{S}_N \mathbf{x}_* $$
This variance has two components: the aleatoric uncertainty $\sigma^2$ (irreducible noise in the data) and the epistemic uncertainty $\mathbf{x}_*^\top \mathbf{S}_N \mathbf{x}_*$ (uncertainty about the model parameters, which shrinks as we see more data).
import numpy as np
from typing import Tuple
def bayesian_linear_regression(
X_train: np.ndarray,
y_train: np.ndarray,
X_test: np.ndarray,
sigma_noise: float = 1.0,
prior_mean: np.ndarray | None = None,
prior_cov: np.ndarray | None = None,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
"""Perform Bayesian linear regression with Gaussian prior.
Args:
X_train: Training features, shape (n_train, d).
y_train: Training targets, shape (n_train,).
X_test: Test features, shape (n_test, d).
sigma_noise: Known noise standard deviation.
prior_mean: Prior mean for weights, shape (d,). Defaults to zeros.
prior_cov: Prior covariance for weights, shape (d, d).
Defaults to identity.
Returns:
Tuple of (posterior_mean, posterior_cov, pred_mean, pred_std).
"""
n, d = X_train.shape
sigma2 = sigma_noise ** 2
# Default prior: zero mean, identity covariance
if prior_mean is None:
prior_mean = np.zeros(d)
if prior_cov is None:
prior_cov = np.eye(d)
S0_inv = np.linalg.inv(prior_cov)
SN = np.linalg.inv(S0_inv + (1 / sigma2) * X_train.T @ X_train)
mN = SN @ (S0_inv @ prior_mean + (1 / sigma2) * X_train.T @ y_train)
# Predictive distribution
pred_mean = X_test @ mN
pred_var = sigma2 + np.sum(X_test @ SN * X_test, axis=1)
pred_std = np.sqrt(pred_var)
return mN, SN, pred_mean, pred_std
10.3.4 Connection to Ridge Regression
Recall from Chapter 8 that ridge regression minimizes:
$$ \mathcal{L}(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|^2 $$
The ridge solution $\mathbf{w}_{\text{ridge}} = (\mathbf{X}^\top\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$ is identical to the posterior mean $\mathbf{m}_N$ when $\mathbf{m}_0 = \mathbf{0}$, $\mathbf{S}_0 = (\sigma^2/\lambda)\mathbf{I}$. Ridge regression is Bayesian linear regression in disguise -- the regularization parameter $\lambda$ controls the precision (inverse variance) of the prior.
10.4 Prior Selection: The Art and Science
Choosing the prior is often the most debated aspect of Bayesian analysis. Critics call it "subjective." Practitioners call it "encoding domain knowledge." Both are right.
10.4.1 Types of Priors
Informative priors encode specific domain knowledge. If prior studies suggest a drug effect is around 5 mg/dL with standard deviation 2, we might use $\mathcal{N}(5, 4)$.
Weakly informative priors constrain parameters to reasonable ranges without being too specific. For example, placing a $\mathcal{N}(0, 10^2)$ prior on regression coefficients when we expect them to be single-digit values.
Non-informative priors (also called "vague" or "diffuse") attempt to let the data speak for itself. Examples include uniform priors over a wide range or Jeffreys' prior $p(\theta) \propto \sqrt{I(\theta)}$ where $I(\theta)$ is the Fisher information.
Hierarchical priors place priors on the hyperparameters themselves, letting the data inform the prior structure. This is the foundation of empirical Bayes and hierarchical Bayesian models.
10.4.2 Prior Sensitivity Analysis
Responsible Bayesian analysis always includes a prior sensitivity analysis: How much do the conclusions change under different reasonable priors?
import numpy as np
from scipy import stats
def prior_sensitivity_analysis(
data: np.ndarray,
prior_params_list: list[dict],
param_grid: np.ndarray,
) -> dict:
"""Analyze posterior sensitivity to different prior choices.
Args:
data: Observed data array.
prior_params_list: List of dicts with 'name', 'alpha', 'beta' keys
defining Beta priors to compare.
param_grid: Grid of parameter values for evaluation.
Returns:
Dictionary mapping prior names to posterior distributions.
"""
n = len(data)
successes = data.sum()
failures = n - successes
results = {}
for prior_params in prior_params_list:
name = prior_params["name"]
alpha_prior = prior_params["alpha"]
beta_prior = prior_params["beta"]
alpha_post = alpha_prior + successes
beta_post = beta_prior + failures
posterior = stats.beta(alpha_post, beta_post)
results[name] = {
"posterior_mean": posterior.mean(),
"posterior_std": posterior.std(),
"credible_interval_95": (
posterior.ppf(0.025),
posterior.ppf(0.975),
),
"pdf": posterior.pdf(param_grid),
}
return results
If the posteriors are similar under different reasonable priors, the analysis is robust. If they differ substantially, more data is needed, or the prior choice requires careful justification.
10.4.3 Guidelines for Prior Selection in Practice
- Start with domain knowledge: If experts say a parameter is "usually between 0 and 10," encode that. In medical studies, for instance, treatment effects are rarely larger than a few standard deviations, so a $\mathcal{N}(0, 5^2)$ prior on a standardized effect size is reasonable.
- Use weakly informative priors as a default: They prevent pathological behavior without imposing strong assumptions. The Stan development team recommends $\mathcal{N}(0, s^2)$ where $s$ is chosen based on the expected scale of the parameter.
- Simulate from the prior: Before fitting, sample parameters from the prior and generate fake data. Does the fake data look plausible? This is called a prior predictive check. For example, if you are modeling house prices and your prior allows the model to predict negative prices or prices exceeding \$1 billion, the prior is too vague.
- Check sensitivity: Always report how results change under alternative reasonable priors. If the posterior is dominated by the prior (not enough data), or if different reasonable priors lead to contradictory conclusions, your analysis needs more data or a more carefully justified prior.
- Prefer proper priors: Improper priors (those that don't integrate to 1) can lead to improper posteriors and computational difficulties. In high-dimensional models, improper priors are particularly dangerous.
10.5 Conjugate Priors
A prior $p(\theta)$ is conjugate to a likelihood $p(\mathcal{D} \mid \theta)$ if the posterior $p(\theta \mid \mathcal{D})$ belongs to the same family as the prior. Conjugacy enables closed-form posterior computation -- no sampling or optimization required.
10.5.1 Common Conjugate Pairs
| Likelihood | Conjugate Prior | Posterior | Use Case |
|---|---|---|---|
| Bernoulli/Binomial | Beta($\alpha$, $\beta$) | Beta($\alpha + h$, $\beta + t$) | Coin flips, click rates |
| Multinomial | Dirichlet($\boldsymbol{\alpha}$) | Dirichlet($\boldsymbol{\alpha} + \mathbf{n}$) | Categorical data, topic models |
| Poisson | Gamma($\alpha$, $\beta$) | Gamma($\alpha + \sum x_i$, $\beta + n$) | Count data, event rates |
| Normal (known $\sigma^2$) | Normal($\mu_0$, $\sigma_0^2$) | Normal($\mu_N$, $\sigma_N^2$) | Continuous measurements |
| Normal (known $\mu$) | Inverse-Gamma($\alpha$, $\beta$) | Inverse-Gamma($\alpha'$, $\beta'$) | Variance estimation |
| Exponential | Gamma($\alpha$, $\beta$) | Gamma($\alpha + n$, $\beta + \sum x_i$) | Waiting times |
10.5.2 The Beta-Binomial Model in Detail
The Beta-Binomial is the workhorse of Bayesian A/B testing. Given a Beta($\alpha$, $\beta$) prior and $n$ trials with $k$ successes:
Posterior: Beta($\alpha + k$, $\beta + n - k$)
Posterior mean: $\frac{\alpha + k}{\alpha + \beta + n}$ -- a weighted average of the prior mean $\frac{\alpha}{\alpha + \beta}$ and the MLE $\frac{k}{n}$.
Posterior variance: $\frac{(\alpha + k)(\beta + n - k)}{(\alpha + \beta + n)^2(\alpha + \beta + n + 1)}$ -- shrinks as $n$ grows.
The pseudocount interpretation is elegant: $\alpha$ and $\beta$ act as "virtual" observations. A Beta(2, 2) prior is equivalent to having already seen 2 successes and 2 failures before the experiment begins.
10.5.3 The Normal-Normal Model
When estimating the mean of a normal distribution with known variance $\sigma^2$, a $\mathcal{N}(\mu_0, \sigma_0^2)$ prior yields:
$$ p(\mu \mid \mathbf{x}) = \mathcal{N}\!\left(\frac{\frac{\mu_0}{\sigma_0^2} + \frac{n\bar{x}}{\sigma^2}}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}}, \; \frac{1}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}}\right) $$
This is most naturally expressed in terms of precisions (inverse variances): $\tau_0 = 1/\sigma_0^2$ and $\tau = n/\sigma^2$. The posterior precision is $\tau_0 + \tau$ -- precisions are additive, which is a beautiful property of the Gaussian family.
10.5.4 Worked Example: Bayesian A/B Testing
The Beta-Binomial model is the foundation of Bayesian A/B testing, which is increasingly used in tech companies as an alternative to frequentist hypothesis testing. Suppose we are comparing two website designs:
- Design A: 1,000 visitors, 120 conversions.
- Design B: 1,000 visitors, 145 conversions.
With uniform priors Beta(1, 1) on both conversion rates:
$$p(\theta_A \mid \text{data}) = \text{Beta}(121, 881), \quad p(\theta_B \mid \text{data}) = \text{Beta}(146, 856)$$
We want to answer: What is the probability that Design B is better than Design A? This is straightforward with sampling:
import numpy as np
from scipy import stats
# Draw posterior samples
n_samples = 100000
samples_A = stats.beta(121, 881).rvs(n_samples)
samples_B = stats.beta(146, 856).rvs(n_samples)
# P(B > A)
prob_B_better = (samples_B > samples_A).mean()
print(f"P(Design B > Design A) = {prob_B_better:.4f}")
# Expected lift
lift = ((samples_B - samples_A) / samples_A).mean()
print(f"Expected lift: {lift:.2%}")
# Risk: expected loss if we choose B but A is actually better
risk_B = np.maximum(samples_A - samples_B, 0).mean()
print(f"Risk of choosing B: {risk_B:.4f}")
Typical output: $P(B > A) \approx 0.96$, expected lift $\approx 2.1\%$, risk $\approx 0.001$. This tells us that Design B is very likely better, with an expected improvement of about 2.1%, and the downside risk of choosing B (if A is actually better) is negligible.
This Bayesian approach has several advantages over the traditional frequentist t-test: (1) it directly answers the question "What is the probability B is better?" rather than the harder-to-interpret "Would we reject the null hypothesis?"; (2) it provides a decision-theoretic framework via expected loss; (3) it handles multiple comparisons more naturally through hierarchical priors.
10.5.5 When Conjugacy Breaks Down
Conjugate priors are convenient but limited. Most real-world models -- neural networks, hierarchical models with complex structure, mixture models -- do not admit conjugate priors. In these cases, we turn to the computational methods of Sections 10.6 and 10.7.
10.6 Markov Chain Monte Carlo (MCMC)
When the posterior lacks a closed form, we approximate it by drawing samples. MCMC constructs a Markov chain whose stationary distribution is the target posterior. After a "burn-in" period, the chain's samples are (approximately) draws from $p(\theta \mid \mathcal{D})$.
10.6.1 Why Sampling?
Given samples $\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(S)}$ from the posterior, we can approximate any quantity of interest:
$$ \mathbb{E}[f(\theta) \mid \mathcal{D}] \approx \frac{1}{S} \sum_{s=1}^{S} f(\theta^{(s)}) $$
This includes posterior means ($f(\theta) = \theta$), variances ($f(\theta) = (\theta - \bar{\theta})^2$), credible intervals (quantiles of the samples), and predictive distributions ($f(\theta) = p(y_* \mid \mathbf{x}_*, \theta)$).
10.6.2 The Metropolis-Hastings Algorithm
The Metropolis-Hastings (MH) algorithm is the foundational MCMC method. It works with any target distribution we can evaluate up to a normalizing constant -- exactly the situation with posteriors, where $p(\theta \mid \mathcal{D}) \propto p(\mathcal{D} \mid \theta) p(\theta)$.
Algorithm:
- Initialize $\theta^{(0)}$
- For $t = 1, 2, \ldots, T$: a. Propose $\theta' \sim q(\theta' \mid \theta^{(t-1)})$ from a proposal distribution b. Compute acceptance ratio: $\alpha = \min\!\left(1, \frac{p(\theta' \mid \mathcal{D}) \, q(\theta^{(t-1)} \mid \theta')}{p(\theta^{(t-1)} \mid \mathcal{D}) \, q(\theta' \mid \theta^{(t-1)})}\right)$ c. Accept with probability $\alpha$: set $\theta^{(t)} = \theta'$; otherwise $\theta^{(t)} = \theta^{(t-1)}$
When the proposal distribution is symmetric -- $q(\theta' \mid \theta) = q(\theta \mid \theta')$ -- this simplifies to the Metropolis algorithm, and the acceptance ratio reduces to $\alpha = \min(1, p(\theta' \mid \mathcal{D}) / p(\theta^{(t-1)} \mid \mathcal{D}))$.
import numpy as np
from typing import Callable
def metropolis_hastings(
log_posterior: Callable[[np.ndarray], float],
initial: np.ndarray,
proposal_std: float,
n_samples: int,
burn_in: int = 1000,
seed: int = 42,
) -> np.ndarray:
"""Run Metropolis-Hastings MCMC sampling.
Args:
log_posterior: Function computing log posterior (up to a constant).
initial: Initial parameter values, shape (d,).
proposal_std: Standard deviation of Gaussian proposal.
n_samples: Number of samples to draw (after burn-in).
burn_in: Number of initial samples to discard.
seed: Random seed for reproducibility.
Returns:
Array of posterior samples, shape (n_samples, d).
"""
rng = np.random.default_rng(seed)
d = len(initial)
total = n_samples + burn_in
samples = np.zeros((total, d))
current = initial.copy()
current_log_p = log_posterior(current)
n_accepted = 0
for t in range(total):
# Symmetric Gaussian proposal
proposal = current + rng.normal(0, proposal_std, size=d)
proposal_log_p = log_posterior(proposal)
# Log acceptance ratio (symmetric proposal simplifies)
log_alpha = proposal_log_p - current_log_p
if np.log(rng.uniform()) < log_alpha:
current = proposal
current_log_p = proposal_log_p
n_accepted += 1
samples[t] = current
acceptance_rate = n_accepted / total
print(f"Acceptance rate: {acceptance_rate:.3f}")
return samples[burn_in:]
10.6.3 Tuning the Proposal Distribution
The proposal standard deviation is a critical hyperparameter:
- Too small: The chain takes tiny steps, exploring slowly (high acceptance rate but high autocorrelation).
- Too large: Most proposals land in low-probability regions and are rejected (low acceptance rate).
- Just right: For Gaussian targets, the optimal acceptance rate is approximately 23.4% in high dimensions (Roberts et al., 1997).
In practice, adaptive schemes adjust the proposal during burn-in to achieve target acceptance rates.
10.6.4 Diagnosing Convergence
MCMC samples are only useful if the chain has converged to the target distribution. Key diagnostics include:
- Trace plots: Plot samples vs. iteration number. A well-mixing chain should look like "white noise" around a stable mean, not show trends or long excursions.
- Autocorrelation: Compute the autocorrelation function (ACF). Rapid decay indicates good mixing.
- $\hat{R}$ (Gelman-Rubin statistic): Run multiple chains from different starting points. $\hat{R}$ compares within-chain and between-chain variance. Values close to 1.0 (below 1.01) indicate convergence.
- Effective sample size (ESS): Accounts for autocorrelation. If you draw 10,000 samples but the ESS is 500, you have roughly 500 independent samples. Aim for ESS > 400 for reliable inference.
10.6.5 Beyond Metropolis-Hastings
Modern MCMC methods improve on basic MH:
Gibbs Sampling samples each parameter conditionally on the others. For a parameter vector $\theta = (\theta_1, \ldots, \theta_d)$, each iteration cycles through:
$$\theta_j^{(t+1)} \sim p(\theta_j \mid \theta_1^{(t+1)}, \ldots, \theta_{j-1}^{(t+1)}, \theta_{j+1}^{(t)}, \ldots, \theta_d^{(t)}, \mathcal{D})$$
Gibbs sampling requires no proposal tuning and has acceptance probability 1 (every sample is accepted). However, it requires that the full conditional distributions $p(\theta_j \mid \theta_{-j}, \mathcal{D})$ are tractable---a condition satisfied by conjugate models and many exponential family models. For models where conditionals are not available in closed form, Gibbs sampling is not applicable.
Hamiltonian Monte Carlo (HMC) uses gradient information to make large, efficient moves through parameter space. It treats the parameter space as a physical system where $\theta$ represents position and an auxiliary variable $\mathbf{p}$ represents momentum. The Hamiltonian $H(\theta, \mathbf{p}) = -\log p(\theta \mid \mathcal{D}) + \frac{1}{2}\mathbf{p}^\top \mathbf{M}^{-1} \mathbf{p}$ governs the dynamics. The algorithm simulates the trajectory of a particle along the posterior surface, proposing distant points that nevertheless have high acceptance probability.
The intuition is that gradient information tells the sampler which direction the posterior is increasing, allowing it to "roll uphill" toward high-probability regions rather than randomly wandering. This dramatically reduces the autocorrelation between samples, yielding many more effective samples per unit of computation.
No-U-Turn Sampler (NUTS) is an adaptive variant of HMC that automatically tunes the trajectory length. Standard HMC requires the user to set the number of leapfrog steps---too few and the sampler explores slowly, too many and it wastes computation or loops back on itself. NUTS detects when the trajectory begins to turn back (the "U-turn" condition) and stops automatically. This eliminates the most difficult tuning parameter of HMC and makes it practical for general use. NUTS is the default sampler in both Stan and PyMC.
10.7 Variational Inference
MCMC provides asymptotically exact samples but can be slow for large datasets or complex models. Variational inference (VI) trades exactness for speed by casting inference as an optimization problem.
10.7.1 The Core Idea
Instead of sampling from the true posterior $p(\theta \mid \mathcal{D})$, we find the closest distribution $q(\theta)$ within a tractable family $\mathcal{Q}$:
$$ q^*(\theta) = \arg\min_{q \in \mathcal{Q}} \text{KL}(q(\theta) \| p(\theta \mid \mathcal{D})) $$
Since the KL divergence involves the unknown posterior, we instead maximize the Evidence Lower Bound (ELBO):
$$ \text{ELBO}(q) = \mathbb{E}_{q}[\log p(\mathcal{D}, \theta)] - \mathbb{E}_{q}[\log q(\theta)] $$
This works because $\log p(\mathcal{D}) = \text{ELBO}(q) + \text{KL}(q \| p)$, so maximizing the ELBO is equivalent to minimizing the KL divergence (since $\log p(\mathcal{D})$ is constant with respect to $q$).
10.7.2 Mean-Field Variational Inference
The most common choice for $\mathcal{Q}$ is the mean-field family, where $q(\theta)$ fully factorizes:
$$ q(\theta) = \prod_{j=1}^{d} q_j(\theta_j) $$
Each factor $q_j$ is optimized in turn while holding the others fixed, yielding a coordinate ascent algorithm. For exponential family models, the optimal $q_j^*$ has a known form:
$$ \log q_j^*(\theta_j) = \mathbb{E}_{q_{-j}}[\log p(\mathcal{D}, \theta)] + \text{const} $$
where $q_{-j}$ denotes all factors except $q_j$.
10.7.3 Deriving the ELBO
The Evidence Lower Bound (ELBO) deserves a careful derivation because it underpins all variational methods, including those used in modern generative models like variational autoencoders (which we will study in Chapter 16).
We start with the log marginal likelihood (log evidence):
$$\log p(\mathcal{D}) = \log \int p(\mathcal{D}, \theta) \, d\theta$$
For any distribution $q(\theta)$, we can write:
$$\log p(\mathcal{D}) = \log \int \frac{p(\mathcal{D}, \theta)}{q(\theta)} q(\theta) \, d\theta$$
By Jensen's inequality (since $\log$ is concave):
$$\log p(\mathcal{D}) \geq \int q(\theta) \log \frac{p(\mathcal{D}, \theta)}{q(\theta)} \, d\theta = \text{ELBO}(q)$$
The gap between $\log p(\mathcal{D})$ and the ELBO is exactly the KL divergence:
$$\log p(\mathcal{D}) = \text{ELBO}(q) + \text{KL}(q(\theta) \| p(\theta \mid \mathcal{D}))$$
Since $\text{KL} \geq 0$, the ELBO is indeed a lower bound on the log evidence. When $q(\theta) = p(\theta \mid \mathcal{D})$, the KL divergence is zero and the bound is tight.
We can decompose the ELBO in a revealing way:
$$\text{ELBO}(q) = \underbrace{\mathbb{E}_q[\log p(\mathcal{D} \mid \theta)]}_{\text{Expected log-likelihood}} - \underbrace{\text{KL}(q(\theta) \| p(\theta))}_{\text{KL from prior}}$$
The first term encourages the variational distribution to concentrate on parameters that explain the data well. The second term penalizes the variational distribution for deviating from the prior. This mirrors the regularization interpretation of Bayesian inference: the ELBO naturally balances data fit against complexity, just as the regularized loss in ridge regression (as we saw in Section 10.3.4) balances the squared error against the weight penalty.
Worked Example: For a univariate Gaussian model with known variance $\sigma^2 = 1$, data $\mathcal{D} = \{x_1, \ldots, x_n\}$, prior $\mu \sim \mathcal{N}(0, \sigma_0^2)$, and variational distribution $q(\mu) = \mathcal{N}(m, s^2)$:
$$\text{ELBO}(m, s^2) = -\frac{n}{2}\log(2\pi) - \frac{1}{2}\sum_{i=1}^n \left[(x_i - m)^2 + s^2\right] - \frac{1}{2}\left[\frac{m^2 + s^2}{\sigma_0^2} - \log\frac{s^2}{\sigma_0^2} - 1\right]$$
Taking derivatives with respect to $m$ and $s^2$ and setting them to zero yields the optimal variational parameters, which in this conjugate case recover the exact posterior.
10.7.4 Stochastic Variational Inference (SVI)
For large datasets, even computing the ELBO over all data is expensive. Stochastic VI uses mini-batches and stochastic gradient ascent:
- Sample a mini-batch of data $\mathcal{B} \subset \mathcal{D}$ of size $B$.
- Compute a noisy estimate of the ELBO gradient using the mini-batch, scaled by $n / B$ to account for the subsampling.
- Take a gradient step to update the variational parameters.
The key insight is that the expected log-likelihood term in the ELBO decomposes as a sum over data points:
$$\mathbb{E}_q[\log p(\mathcal{D} \mid \theta)] = \sum_{i=1}^n \mathbb{E}_q[\log p(x_i \mid \theta)] \approx \frac{n}{B} \sum_{i \in \mathcal{B}} \mathbb{E}_q[\log p(x_i \mid \theta)]$$
This unbiased estimate enables Bayesian inference on datasets with millions of examples---a scale where MCMC is impractical.
Reparameterization trick: For gradient-based optimization with Gaussian variational distributions, the reparameterization trick expresses samples from $q(\theta) = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2)$ as $\theta = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}$ where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. This moves the randomness out of the parameters, enabling standard backpropagation through the ELBO. This same trick is central to variational autoencoders, which we will encounter in Chapter 16.
10.7.5 MCMC vs. Variational Inference
| Aspect | MCMC | Variational Inference |
|---|---|---|
| Accuracy | Asymptotically exact | Approximate (typically underestimates variance) |
| Scalability | Challenging for large datasets | Scales well with SGD |
| Convergence | Difficult to diagnose | Tracked via ELBO |
| Multimodality | Can explore multiple modes (slowly) | Typically captures one mode |
| Implementation | Requires careful tuning | Standard optimization tools |
| When to use | Final analysis, small-medium data | Exploration, large data, deep models |
In practice, many teams use VI for rapid prototyping and model selection, then switch to MCMC for final inference when accuracy is paramount.
10.7.6 Amortized Inference
A powerful extension of variational inference is amortized inference, where the variational parameters are predicted by a neural network rather than optimized per data point. Instead of finding optimal $q(\theta)$ for each observation, we train an inference network $q_\phi(\mathbf{z} \mid \mathbf{x})$ that maps any input $\mathbf{x}$ directly to approximate posterior parameters.
The advantage is that once the inference network is trained, computing the approximate posterior for a new observation requires only a single forward pass---no optimization loop. This is the key innovation behind variational autoencoders (Chapter 16) and has enabled Bayesian inference at unprecedented scale.
The cost of amortization is an additional source of approximation error (the amortization gap): the inference network may not perfectly predict the optimal variational parameters for every input. However, this cost is usually outweighed by the dramatic speedup at inference time.
10.7.7 Normalizing Flows
Mean-field variational inference restricts $q(\theta)$ to factored distributions, which cannot capture posterior correlations. Normalizing flows overcome this limitation by transforming a simple base distribution through a sequence of invertible, differentiable transformations:
$$q(\theta) = q_0(\mathbf{z}_0) \prod_{k=1}^{K} \left| \det \frac{\partial \mathbf{f}_k}{\partial \mathbf{z}_{k-1}} \right|^{-1}$$
where $\mathbf{z}_K = \mathbf{f}_K \circ \cdots \circ \mathbf{f}_1(\mathbf{z}_0)$ and $\mathbf{z}_0 \sim q_0$ (typically a standard Gaussian). By choosing transformations with efficient Jacobian computation (e.g., masked autoregressive flows, real NVP), normalizing flows can represent highly complex, multimodal posterior distributions while remaining tractable for optimization.
10.8 Gaussian Processes
A Gaussian process (GP) is a distribution over functions, where any finite collection of function values has a joint Gaussian distribution. GPs provide a principled, nonparametric approach to regression with built-in uncertainty quantification.
10.8.1 Definition
A GP is fully specified by a mean function $m(\mathbf{x})$ and a covariance (kernel) function $k(\mathbf{x}, \mathbf{x}')$:
$$ f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}')) $$
For any finite set of inputs $\mathbf{X} = \{\mathbf{x}_1, \ldots, \mathbf{x}_n\}$:
$$ \mathbf{f} = [f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)]^\top \sim \mathcal{N}(\mathbf{m}, \mathbf{K}) $$
where $m_i = m(\mathbf{x}_i)$ and $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$.
10.8.2 Common Kernel Functions
The kernel encodes assumptions about the function's properties:
Radial Basis Function (RBF) / Squared Exponential:
$$ k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\!\left(-\frac{\|\mathbf{x} - \mathbf{x}'\|^2}{2\ell^2}\right) $$
Produces infinitely smooth functions. $\ell$ is the length scale (how far apart inputs must be before outputs become uncorrelated) and $\sigma_f^2$ is the signal variance.
Matern Kernel:
$$ k_{\nu}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}\,r}{\ell}\right)^\nu K_\nu\!\left(\frac{\sqrt{2\nu}\,r}{\ell}\right) $$
where $r = \|\mathbf{x} - \mathbf{x}'\|$ and $K_\nu$ is a modified Bessel function. The smoothness parameter $\nu$ controls differentiability: $\nu = 1/2$ gives the Ornstein-Uhlenbeck process (rough), $\nu = 3/2$ is once-differentiable, $\nu = 5/2$ is twice-differentiable, and $\nu \to \infty$ recovers the RBF kernel.
Periodic Kernel:
$$ k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\!\left(-\frac{2\sin^2(\pi |x - x'| / p)}{\ell^2}\right) $$
Models periodic functions with period $p$.
Kernels can be combined by addition (modeling independent effects) or multiplication (modeling interactions), creating rich hypothesis spaces.
10.8.3 GP Regression
Given training data $(\mathbf{X}, \mathbf{y})$ with noise $y_i = f(\mathbf{x}_i) + \epsilon_i$, $\epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$, the predictive distribution at test points $\mathbf{X}_*$ is:
$$ p(\mathbf{f}_* \mid \mathbf{X}_*, \mathbf{X}, \mathbf{y}) = \mathcal{N}(\boldsymbol{\mu}_*, \boldsymbol{\Sigma}_*) $$
where:
$$ \boldsymbol{\mu}_* = \mathbf{K}_*^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y} $$
$$ \boldsymbol{\Sigma}_* = \mathbf{K}_{**} - \mathbf{K}_*^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{K}_* $$
Here $\mathbf{K} = k(\mathbf{X}, \mathbf{X})$ is the training kernel matrix, $\mathbf{K}_* = k(\mathbf{X}, \mathbf{X}_*)$ is the cross-kernel, and $\mathbf{K}_{**} = k(\mathbf{X}_*, \mathbf{X}_*)$.
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, Matern, WhiteKernel
# Define kernel: RBF + noise
kernel = 1.0 * RBF(length_scale=1.0) + WhiteKernel(noise_level=0.1)
# Fit GP
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)
gp.fit(X_train, y_train)
# Predict with uncertainty
y_pred, y_std = gp.predict(X_test, return_std=True)
10.8.4 Hyperparameter Optimization
GP kernels have hyperparameters (length scales, signal variance, noise variance) that are typically optimized by maximizing the log marginal likelihood:
$$ \log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\psi}) = -\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1}\mathbf{y} - \frac{1}{2}\log|\mathbf{K}_y| - \frac{n}{2}\log(2\pi) $$
where $\mathbf{K}_y = \mathbf{K} + \sigma_n^2\mathbf{I}$ and $\boldsymbol{\psi}$ denotes all kernel hyperparameters. This objective naturally balances data fit (first term) against model complexity (second term), providing an automatic Occam's razor.
10.8.5 Computational Considerations
The main bottleneck is inverting the $n \times n$ kernel matrix, which costs $\mathcal{O}(n^3)$ in time and $\mathcal{O}(n^2)$ in memory. For datasets beyond a few thousand points, exact GP inference becomes impractical. Approximations include:
- Sparse GPs (inducing points): Select $m \ll n$ inducing points and approximate the full GP. Reduces cost to $\mathcal{O}(nm^2)$.
- Random Fourier features: Approximate the kernel with random feature expansions, converting the GP to a Bayesian linear model.
- Structured kernel interpolation (SKI/KISS-GP): Exploit grid structure for fast matrix-vector multiplications.
10.8.6 Connection to Bayesian Optimization
Gaussian processes are the engine behind Bayesian optimization, a powerful method for optimizing expensive black-box functions. As we discussed in Chapter 8 (Section 8.7.4), hyperparameter tuning can be framed as optimizing a function $f(\boldsymbol{\lambda})$ (e.g., validation accuracy as a function of hyperparameters $\boldsymbol{\lambda}$) that is expensive to evaluate, has no closed-form expression, and provides no gradient information.
Bayesian optimization uses a GP surrogate model to approximate $f$ and an acquisition function to decide which configuration to evaluate next. The most common acquisition functions are:
Expected Improvement (EI): The expected amount by which a new evaluation will improve over the current best:
$$\text{EI}(\boldsymbol{\lambda}) = \mathbb{E}\left[\max(f(\boldsymbol{\lambda}) - f^+, 0)\right]$$
where $f^+$ is the best observed value so far. Under the GP posterior, this has a closed-form expression:
$$\text{EI}(\boldsymbol{\lambda}) = (m(\boldsymbol{\lambda}) - f^+ - \xi) \Phi(Z) + s(\boldsymbol{\lambda}) \phi(Z)$$
where $Z = \frac{m(\boldsymbol{\lambda}) - f^+ - \xi}{s(\boldsymbol{\lambda})}$, $m$ and $s$ are the GP predictive mean and standard deviation, and $\Phi$ and $\phi$ are the standard normal CDF and PDF. The parameter $\xi \geq 0$ controls the exploration-exploitation tradeoff.
Upper Confidence Bound (UCB): Selects the point with the highest optimistic estimate:
$$\text{UCB}(\boldsymbol{\lambda}) = m(\boldsymbol{\lambda}) + \kappa \cdot s(\boldsymbol{\lambda})$$
where $\kappa > 0$ controls exploration (higher $\kappa$ = more exploration). UCB is intuitive: it favors points where the predicted value is high ($m$ term) or the uncertainty is large ($s$ term).
The Bayesian optimization loop proceeds as follows:
- Evaluate $f$ at a few initial points (e.g., 5-10 random configurations).
- Fit a GP to all observed (configuration, performance) pairs.
- Maximize the acquisition function to select the next configuration $\boldsymbol{\lambda}^*$.
- Evaluate $f(\boldsymbol{\lambda}^*)$ and add the result to the dataset.
- Repeat from step 2 until the budget is exhausted.
# Conceptual Bayesian optimization with scikit-optimize
# from skopt import gp_minimize
# from skopt.space import Real, Integer
#
# space = [
# Real(1e-4, 1e-1, prior="log-uniform", name="learning_rate"),
# Integer(50, 500, name="n_estimators"),
# Integer(2, 10, name="max_depth"),
# ]
#
# def objective(params):
# lr, n_est, depth = params
# model = GradientBoostingClassifier(
# learning_rate=lr, n_estimators=n_est, max_depth=depth
# )
# score = cross_val_score(model, X, y, cv=5, scoring="f1").mean()
# return -score # Minimize negative score
#
# result = gp_minimize(objective, space, n_calls=50, random_state=42)
Bayesian optimization typically finds good hyperparameters in 20--50 evaluations, compared to hundreds for random search. This sample efficiency makes it the method of choice when each evaluation is expensive (e.g., training a large neural network for hours).
10.9 Bayesian Neural Networks
The deep learning revolution (which we will explore in Chapters 11--14) is built on point estimates---neural networks are typically trained to find a single set of weights via maximum likelihood or MAP estimation. Bayesian neural networks (BNNs) maintain a distribution over weights, providing uncertainty estimates for every prediction.
10.9.1 Why Bayesian Neural Networks?
Standard neural networks are overconfident: they produce high-confidence predictions even for inputs far from the training data. A classifier trained on cats and dogs may confidently predict "cat" when shown a car. BNNs address this by propagating weight uncertainty through the network to produce calibrated predictive distributions.
For a neural network with weights $\mathbf{w}$, the predictive distribution for a new input $\mathbf{x}_*$ is:
$$p(y_* \mid \mathbf{x}_*, \mathcal{D}) = \int p(y_* \mid \mathbf{x}_*, \mathbf{w}) \, p(\mathbf{w} \mid \mathcal{D}) \, d\mathbf{w}$$
This integral averages predictions over all plausible weight configurations, weighted by their posterior probability. In regions where the data constrains the weights tightly, the predictions will be confident. In regions far from the training data, the weights are uncertain, producing wide predictive distributions.
10.9.2 Approximate Inference for BNNs
Exact posterior inference for neural networks is intractable---the posterior over millions of weights has no closed form. Practical approaches include:
MC Dropout: Gal and Ghahramani (2016) showed that dropout (a standard regularization technique) at test time approximates variational inference. By running the network $T$ times with dropout enabled, we obtain $T$ different predictions whose spread estimates epistemic uncertainty:
$$\hat{y}_* = \frac{1}{T} \sum_{t=1}^{T} f(\mathbf{x}_*; \hat{\mathbf{w}}_t)$$
$$\text{Var}(y_*) \approx \frac{1}{T} \sum_{t=1}^{T} f(\mathbf{x}_*; \hat{\mathbf{w}}_t)^2 - \hat{y}_*^2$$
where $\hat{\mathbf{w}}_t$ are the weights with different dropout masks applied.
Bayes by Backprop: Blundell et al. (2015) proposed parameterizing $q(\mathbf{w}) = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2)$ and optimizing the ELBO using the reparameterization trick (Section 10.7.4). Each weight has a mean $\mu$ and variance $\sigma^2$ that are learned via backpropagation.
Deep ensembles: While not formally Bayesian, training $M$ networks independently with different random initializations produces a set of predictions whose disagreement captures epistemic uncertainty. Lakshminarayanan et al. (2017) showed that deep ensembles provide competitive uncertainty estimates and are simpler to implement than full BNN methods.
10.9.3 When to Use BNNs
BNNs are most valuable in safety-critical applications where knowing what the model does not know is as important as the prediction itself:
- Medical diagnosis: A model that says "I am uncertain about this case---refer to a specialist" is more valuable than one that makes a confident but wrong diagnosis.
- Autonomous driving: The vehicle should slow down or request human intervention when its perception model is uncertain.
- Active learning: Use predictive uncertainty to select the most informative examples for labeling, reducing annotation costs.
The computational overhead of BNNs (typically 2--10x the cost of point estimation, depending on the method) is the main barrier to wider adoption.
10.10 Probabilistic Programming
Probabilistic programming languages (PPLs) automate much of the Bayesian workflow. You specify the model, and the PPL handles inference.
10.9.1 The Abstraction
A probabilistic program defines a joint distribution over observed and latent variables. The PPL provides:
- Model specification: A concise syntax for defining priors, likelihoods, and dependencies.
- Automatic inference: MCMC, VI, or other algorithms with minimal user intervention.
- Diagnostics: Convergence checks, posterior summaries, and visualization.
10.9.2 The Python Ecosystem
PyMC (formerly PyMC3): A mature, user-friendly PPL built on Aesara/PyTensor. Uses NUTS by default.
# PyMC example (conceptual -- requires PyMC installation)
import pymc as pm
with pm.Model() as linear_model:
# Priors
alpha = pm.Normal("alpha", mu=0, sigma=10)
beta = pm.Normal("beta", mu=0, sigma=10)
sigma = pm.HalfNormal("sigma", sigma=1)
# Likelihood
mu = alpha + beta * X_data
y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y_data)
# Inference
trace = pm.sample(2000, return_inferencedata=True)
NumPyro: A lightweight PPL built on JAX, offering GPU-accelerated inference and composable transformations. Excellent for research and high-performance applications.
Stan (via CmdStanPy/PyStan): The gold standard for statistical modeling. Implements state-of-the-art HMC/NUTS. Strong diagnostics and a rigorous community.
TensorFlow Probability: Integrates with TensorFlow for deep probabilistic models. Supports VI and MCMC.
10.10.3 A Practical PyMC Example: Hierarchical Model
To illustrate the power of probabilistic programming, consider a hierarchical model for estimating the conversion rates of multiple marketing campaigns simultaneously. Each campaign $j$ has its own conversion rate $\theta_j$, but we assume they share a common group-level distribution:
# Hierarchical model for marketing campaign conversion rates
import pymc as pm
import numpy as np
# Observed data: trials and successes for 8 campaigns
trials = np.array([100, 250, 50, 300, 80, 150, 200, 120])
successes = np.array([12, 35, 8, 52, 14, 28, 40, 20])
with pm.Model() as hierarchical_model:
# Hyperpriors: group-level parameters
alpha = pm.HalfNormal("alpha", sigma=5)
beta = pm.HalfNormal("beta", sigma=5)
# Campaign-level conversion rates (shared prior)
theta = pm.Beta("theta", alpha=alpha, beta=beta, shape=len(trials))
# Likelihood
obs = pm.Binomial("obs", n=trials, p=theta, observed=successes)
# Inference
trace = pm.sample(2000, tune=1000, chains=4, random_seed=42)
The hierarchical structure enables shrinkage: campaigns with few observations are pulled toward the group mean, while campaigns with many observations are estimated closer to their observed rate. This is especially valuable when some campaigns have very few data points. Compare this to the non-hierarchical alternative where each campaign's rate is estimated independently --- with only 50 trials (campaign 3), the MLE of $8/50 = 0.16$ is highly uncertain, but the hierarchical model borrows strength from the other campaigns to produce a more stable estimate.
10.10.4 When to Use a PPL
Use a PPL when: - The model has complex structure (hierarchical, mixture, state-space) - You need full posterior inference, not just point estimates - The model changes frequently during development - You want automatic gradient computation for HMC
Stick with closed-form solutions or scikit-learn when: - Conjugate models suffice (Beta-Binomial, Normal-Normal) - Speed is critical and the model is simple - You only need point estimates and perhaps basic intervals
10.11 Bayesian vs. Frequentist Perspectives
This is not an ideological war. Both frameworks are tools, and understanding their differences helps you choose wisely.
10.11.1 Philosophical Differences
| Concept | Frequentist | Bayesian |
|---|---|---|
| Probability | Long-run frequency of events | Degree of belief or plausibility |
| Parameters | Fixed but unknown constants | Random variables with distributions |
| Data | Random (could have been different) | Fixed (what we actually observed) |
| Inference | Estimators, p-values, confidence intervals | Posterior distributions, credible intervals |
| Uncertainty | From sampling variability | From incomplete knowledge |
10.11.2 Confidence Intervals vs. Credible Intervals
A 95% confidence interval means: if we repeated the experiment many times, 95% of the intervals we compute would contain the true parameter. It does not mean there is a 95% probability the parameter lies in this particular interval.
A 95% credible interval means: given the data and prior, there is a 95% posterior probability the parameter lies in this interval. This is often what practitioners actually want.
10.11.3 When to Use Which
Favor frequentist methods when: - Regulatory requirements demand them (clinical trials, FDA submissions) - The model is simple and well-understood - Prior information is genuinely absent or controversial - Computational resources are severely limited - Large sample sizes make the likelihood dominant anyway
Favor Bayesian methods when: - Uncertainty quantification is critical - Prior information is available and valuable - Data is scarce (small sample sizes) - The model is complex (hierarchical, latent variables) - Sequential decision-making is needed - You need to answer questions like "What is the probability that treatment A is better than treatment B?"
10.11.4 The Pragmatic View
As Andrew Gelman has argued, the Bayesian/frequentist distinction matters less than the quality of the model. A good Bayesian analysis with a bad model will give worse results than a good frequentist analysis with a well-specified model. Focus on:
- Does the model capture the important features of the data-generating process?
- Have you checked the model's assumptions against the data?
- Are your conclusions robust to reasonable perturbations?
In practice, most modern statisticians and AI engineers use both paradigms. The deep learning revolution, for instance, is largely frequentist in its point estimation and SGD optimization, but borrows Bayesian ideas for regularization (weight decay = Gaussian prior), dropout (approximate variational inference), and ensemble methods (approximate posterior predictive distributions).
10.12 Connection to Modern Generative Models
The probabilistic and Bayesian ideas in this chapter form the theoretical backbone of many cutting-edge AI systems. Understanding these connections helps you see how foundational concepts power state-of-the-art models.
10.12.1 Variational Autoencoders (VAEs)
Variational autoencoders, which we will study in Chapter 16, are a direct application of variational inference to deep generative models. A VAE defines a generative model:
$$p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, d\mathbf{z}$$
where $\mathbf{z}$ is a latent variable, $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$ is the prior, and $p(\mathbf{x} \mid \mathbf{z})$ is a decoder network. The posterior $p(\mathbf{z} \mid \mathbf{x})$ is intractable, so a recognition (encoder) network $q_\phi(\mathbf{z} \mid \mathbf{x})$ approximates it by maximizing the ELBO:
$$\text{ELBO} = \mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})}[\log p_\theta(\mathbf{x} \mid \mathbf{z})] - \text{KL}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z}))$$
This is precisely the variational inference framework from Section 10.7, with the variational distribution parameterized by a neural network. The reparameterization trick (Section 10.7.4) enables end-to-end gradient-based training.
10.12.2 Diffusion Models
Diffusion models, the architecture behind modern image generators like DALL-E and Stable Diffusion, can be understood through the lens of hierarchical Bayesian models. They define a forward process that gradually adds Gaussian noise to data (a Markov chain), and learn the reverse process (a conditional Gaussian) that denoises step by step. The training objective is closely related to the ELBO, decomposed across time steps.
10.12.3 Bayesian Ideas in Large Language Models
Even large language models (LLMs), which are trained via maximum likelihood on massive datasets, borrow Bayesian intuitions:
- Temperature sampling (Section 10.7.3) mirrors temperature scaling in Bayesian posterior inference.
- Ensemble methods (training multiple models with different random seeds) approximate Bayesian model averaging.
- Calibration of LLM confidence scores is an active research area that applies the probability calibration ideas from Chapter 8.
- Retrieval-augmented generation can be viewed as conditioning the model on additional "evidence" --- analogous to updating a prior with new data.
Understanding these connections helps you see that the Bayesian framework is not an isolated academic exercise but a living, evolving paradigm that continues to shape the frontier of AI.
10.13 Practical Bayesian Modeling with PyMC
To make the ideas in this chapter concrete, let us walk through a complete Bayesian modeling example using PyMC. We will model the relationship between advertising spend and sales using Bayesian linear regression, demonstrating the full workflow from prior specification to posterior analysis.
# Full Bayesian workflow with PyMC (requires PyMC installation)
import pymc as pm
import numpy as np
import arviz as az
# Generate synthetic advertising data
np.random.seed(42)
n = 50
ad_spend = np.random.uniform(10, 100, n)
true_intercept = 20.0
true_slope = 0.5
noise = np.random.normal(0, 5, n)
sales = true_intercept + true_slope * ad_spend + noise
with pm.Model() as advertising_model:
# Step 1: Define priors
# Weakly informative: we expect positive but uncertain relationship
intercept = pm.Normal("intercept", mu=0, sigma=50)
slope = pm.Normal("slope", mu=0, sigma=5)
sigma = pm.HalfNormal("sigma", sigma=10)
# Step 2: Define likelihood
mu = intercept + slope * ad_spend
y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, observed=sales)
# Step 3: Prior predictive check
prior_pred = pm.sample_prior_predictive(samples=200)
# Step 4: Fit the model using NUTS
trace = pm.sample(
draws=2000,
tune=1000,
chains=4,
return_inferencedata=True,
random_seed=42,
)
# Step 5: Posterior predictive check
post_pred = pm.sample_posterior_predictive(trace)
# Step 6: Diagnose convergence
print(az.summary(trace, var_names=["intercept", "slope", "sigma"]))
# Check R-hat values (should be < 1.01)
# Check ESS (should be > 400)
# Step 7: Interpret results
slope_samples = trace.posterior["slope"].values.flatten()
print(f"P(slope > 0) = {(slope_samples > 0).mean():.4f}")
print(f"Posterior mean slope: {slope_samples.mean():.3f}")
print(f"95% credible interval: "
f"[{np.percentile(slope_samples, 2.5):.3f}, "
f"{np.percentile(slope_samples, 97.5):.3f}]")
This example illustrates several key points:
- Prior specification: We use weakly informative priors that allow a wide range of values but prevent pathological behavior.
- Prior predictive checks: Before fitting, we verify that data simulated from the prior looks plausible.
- NUTS sampling: PyMC automatically selects the NUTS sampler, which adapts its step size during tuning.
- Convergence diagnostics: We check $\hat{R}$ (should be close to 1.0) and effective sample size (should be at least 400).
- Direct probability statements: Unlike frequentist confidence intervals, we can make statements like "there is a 99.8% posterior probability that the slope is positive."
10.14 Putting It All Together: A Bayesian Workflow
Andrew Gelman and colleagues have articulated a Bayesian workflow that goes well beyond just "run Bayes' theorem." Here is a summary suitable for AI engineering:
-
Formulate the model: Define the joint distribution $p(\theta, \mathcal{D})$. Draw a directed acyclic graph (DAG) showing dependencies.
-
Prior predictive checks: Sample from $p(\theta)$ and then from $p(\mathcal{D} \mid \theta)$. Does the simulated data look plausible? If not, adjust priors.
-
Fit the model: Use conjugate updates, MCMC, or VI as appropriate.
-
Diagnose computation: Check trace plots, $\hat{R}$, ESS, divergences (for HMC).
-
Posterior predictive checks: Generate data from the fitted model. Compare to the actual data. Systematic discrepancies indicate model misspecification.
-
Evaluate and compare models: Use the marginal likelihood, WAIC, or LOO-CV to compare candidate models.
-
Sensitivity analysis: Vary the priors and check robustness.
-
Communicate: Report posterior summaries (means, credible intervals), visualize posterior distributions, and clearly state assumptions.
This workflow is iterative. Posterior predictive checks often reveal problems that send you back to step 1 with a better model.
10.15 Practical Guidelines for Bayesian AI Engineering
Before concluding, here are practical guidelines distilled from years of applied Bayesian modeling:
Start simple, add complexity gradually. Begin with the simplest model that captures the essential structure of your problem. Add hierarchical structure, non-linear components, or mixture models only when posterior predictive checks reveal systematic discrepancies.
Always check your priors. Run prior predictive checks: sample parameters from the prior, simulate data, and verify that the simulated data covers plausible scenarios. If your prior allows predictions of negative house prices or humans taller than 10 meters, the prior is wrong.
Use informative priors when you have information. The "non-informative prior" is a myth in most practical settings. You almost always know something: coefficients are probably single-digit, variances are positive, probabilities are between 0 and 1. Encode this knowledge.
Run multiple chains and check convergence. For MCMC, always run at least 4 chains from different starting points. Check $\hat{R} < 1.01$ and ESS $> 400$ for all parameters. Divergences in HMC are warnings that the sampler is struggling---never ignore them.
Validate with posterior predictive checks. Generate data from the fitted model and compare to the actual data. Use both global summaries (mean, variance, quantiles) and domain-specific statistics (e.g., the number of zero observations, the correlation between features).
Consider computational cost early. MCMC is impractical for models with millions of parameters or millions of data points. Plan your inference strategy---MCMC, VI, or hybrid---based on your computational budget and accuracy requirements.
Communicate uncertainty honestly. Report full posterior distributions or credible intervals, not just point estimates. Decision-makers benefit from knowing the range of plausible outcomes, not just the most likely one.
10.16 Summary
This chapter has equipped you with a powerful toolkit for reasoning under uncertainty. The Bayesian paradigm is not merely an alternative to frequentist methods---it is a complete framework for learning from data that naturally handles uncertainty, incorporates prior knowledge, and provides coherent answers to decision-making questions.
Here are the key ideas:
- Bayesian thinking provides a coherent framework for updating beliefs with data, with the posterior as the central object of inference.
- Naive Bayes demonstrates that even crude probabilistic models can be remarkably effective classifiers, especially in high-dimensional settings.
- Bayesian linear regression extends ordinary regression with full uncertainty quantification, connecting regularization to prior distributions.
- Prior selection is both art and science, guided by domain knowledge, sensitivity analysis, and prior predictive checks.
- Conjugate priors enable exact posterior computation for a useful family of models.
- MCMC provides asymptotically exact posterior samples for arbitrary models, with Metropolis-Hastings as the foundational algorithm and HMC/NUTS as the modern workhorses.
- Variational inference offers a scalable alternative to MCMC, trading exactness for speed through optimization.
- Gaussian processes define distributions over functions, providing elegant nonparametric regression with uncertainty.
- Probabilistic programming automates the Bayesian workflow, letting you focus on modeling rather than inference mechanics.
In the next chapter, we will explore how ensemble methods combine multiple models to achieve greater accuracy and robustness -- a theme that connects naturally to Bayesian model averaging, where we weight predictions by posterior model probabilities.
The probabilistic toolkit you have built in this chapter---from conjugate analysis through MCMC and variational inference to Gaussian processes and Bayesian neural networks---will serve as the foundation for understanding many advanced topics in the remainder of this book. Variational inference powers the generative models of Chapter 16. Bayesian optimization enables efficient hyperparameter tuning (Chapter 8). Uncertainty quantification is essential for deploying responsible AI systems (Chapter 20). And the probabilistic programming paradigm continues to grow in importance as models become more complex and the need for principled uncertainty quantification becomes more pressing.
References
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
- Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.
- Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
- Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.
- Roberts, G. O., Gelman, A., & Gilks, W. R. (1997). Weak convergence and optimal scaling of random walk Metropolis algorithms. Annals of Applied Probability, 7(1), 110-120.
- Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2-3), 103-130.
- Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518), 859-877.
- Hoffman, M. D., & Gelman, A. (2014). The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15, 1593-1623.
Related Reading
Explore this topic in other books
Sports Betting Bayesian Thinking Prediction Markets Calibration & Bayesian Updating