> "The subjectivist states his judgments, whereas the objectivist sweeps them under the carpet by calling assumptions 'knowledge' and 'facts' whenever he can get away with it."
In This Chapter
- Learning Objectives
- 20.1 The Wrong Debate
- 20.2 Bayes' Theorem as Belief Updating
- 20.3 The Mechanics: Prior, Likelihood, Posterior
- 20.4 Conjugate Priors: When the Math Works Out
- 20.5 The MAP-MLE-Regularization Triangle
- 20.6 Prior Selection: A Practical Framework
- 20.7 Bayesian Updating and Sequential Inference
- 20.8 Credible Intervals vs. Confidence Intervals
- 20.9 Bayesian Model Comparison
- 20.10 When Bayesian Methods Add Value
- 20.11 The StreamRec Application: Bayesian User Preferences
- 20.12 The Pharma Application: Bayesian Treatment Effect Reasoning
- 20.13 Looking Ahead: When Conjugacy Fails
- Summary
- Chapter 20 Notation Reference
Chapter 20: Bayesian Thinking — Priors, Posteriors, and Why Frequentist vs. Bayesian Is the Wrong Debate
"The subjectivist states his judgments, whereas the objectivist sweeps them under the carpet by calling assumptions 'knowledge' and 'facts' whenever he can get away with it." — I.J. Good, Good Thinking: The Foundations of Probability and Its Applications (1983)
Learning Objectives
By the end of this chapter, you will be able to:
- Apply Bayes' theorem to update beliefs with data and interpret posterior distributions
- Select appropriate priors (uninformative, weakly informative, informative) and explain their influence on the posterior
- Derive conjugate posterior distributions for common likelihood-prior pairs
- Explain the connection between MAP estimation, MLE, and regularization
- Articulate when Bayesian methods add value over frequentist alternatives and when they don't
20.1 The Wrong Debate
Every statistics course you have ever taken has either avoided the frequentist-Bayesian debate or treated it as a philosophical argument about the nature of probability. Frequentists insist that probability is long-run frequency. Bayesians insist that probability is degree of belief. Textbooks take sides, students pick teams, and meanwhile the actual work of data science continues using whichever approach produces the most useful answer for the problem at hand.
This chapter refuses to pick a team. Here is the position we will defend: frequentist and Bayesian methods are different tools for different problems. They agree far more often than they disagree. When they do disagree, the disagreement is almost always about something practical — how to incorporate prior knowledge, how to handle small samples, how to report uncertainty — and the right answer depends on the context, not on philosophical commitments.
A frequentist confidence interval and a Bayesian credible interval often cover similar regions. A maximum likelihood estimate and a posterior mean under a flat prior are the same number. The differences emerge in the margins — and the margins are exactly where real problems live: small datasets where prior knowledge matters, sequential decisions where you must update beliefs in real time, hierarchical structures where partial pooling improves every group's estimate.
By the end of this chapter, you will be comfortable using Bayesian methods when they add value, frequentist methods when they suffice, and recognizing the difference. The first step is understanding what Bayesian inference actually is.
20.2 Bayes' Theorem as Belief Updating
The Theorem
Bayes' theorem is a consequence of the product rule of probability. For a parameter $\theta$ and observed data $D$:
$$p(\theta \mid D) = \frac{p(D \mid \theta) \, p(\theta)}{p(D)}$$
Each term has a name and a role:
| Symbol | Name | Role |
|---|---|---|
| $p(\theta)$ | Prior | What you believe about $\theta$ before seeing data |
| $p(D \mid \theta)$ | Likelihood | How probable the data are if $\theta$ were true |
| $p(D) = \int p(D \mid \theta) \, p(\theta) \, d\theta$ | Marginal likelihood (evidence) | Total probability of the data, averaging over all $\theta$ |
| $p(\theta \mid D)$ | Posterior | What you believe about $\theta$ after seeing data |
The marginal likelihood $p(D)$ does not depend on $\theta$. It is a normalizing constant that ensures the posterior integrates to 1. For most purposes, we can write:
$$p(\theta \mid D) \propto p(D \mid \theta) \, p(\theta)$$
This proportionality — posterior is proportional to likelihood times prior — is the working form of Bayes' theorem. It says: take what you believed before (the prior), multiply by how well those beliefs explain the data (the likelihood), and normalize to get what you believe now (the posterior).
A Concrete Example: Does This Coin Cheat?
Suppose you have a coin and want to estimate $\theta$, the probability of heads. You flip the coin 10 times and observe 7 heads.
Frequentist approach. The MLE is $\hat{\theta}_{\text{MLE}} = 7/10 = 0.70$. A 95% confidence interval (using the Wald method) is approximately $0.70 \pm 1.96 \sqrt{0.70 \cdot 0.30 / 10} = [0.42, 0.98]$.
Bayesian approach. You need a prior. If you have no strong opinion, you might use $\theta \sim \text{Beta}(1, 1)$, which is uniform on $[0, 1]$. The likelihood for $k = 7$ heads in $n = 10$ flips is Binomial: $p(D \mid \theta) = \binom{10}{7} \theta^7 (1 - \theta)^3$. Applying Bayes' theorem (we will derive this fully in Section 20.4):
$$p(\theta \mid D) = \text{Beta}(\theta \mid 1 + 7, 1 + 3) = \text{Beta}(\theta \mid 8, 4)$$
The posterior mean is $8/(8+4) = 0.667$, and the 95% credible interval is $[0.39, 0.90]$.
Notice two things. First, the posterior mean (0.667) is slightly pulled toward 0.5 compared to the MLE (0.70). The uniform prior contributes one "pseudo-observation" of each outcome, shrinking the estimate toward the prior mean. Second, the credible interval is narrower than the confidence interval, because the prior excludes extreme values.
Now suppose you are a casino inspector, and this coin comes from a reputable manufacturer. You know from testing thousands of coins that the manufacturing process produces coins with $\theta$ values tightly clustered around 0.50, with standard deviation about 0.02. Your prior should encode this knowledge:
$$\theta \sim \text{Beta}(625, 625)$$
This Beta distribution has mean 0.50 and standard deviation approximately 0.014. The posterior after observing 7/10 heads is:
$$p(\theta \mid D) = \text{Beta}(\theta \mid 625 + 7, 625 + 3) = \text{Beta}(\theta \mid 632, 628)$$
The posterior mean is $632 / (632 + 628) = 0.5016$. The data barely moved the posterior because the prior was so concentrated — and that is the right answer if you have genuinely tested thousands of coins. Ten flips should not override the evidence from thousands of prior tests.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from typing import Tuple
def plot_prior_likelihood_posterior(
alpha_prior: float,
beta_prior: float,
n_heads: int,
n_tails: int,
title: str = ""
) -> None:
"""Visualize the prior, (scaled) likelihood, and posterior for Beta-Binomial.
Args:
alpha_prior: Alpha parameter of the Beta prior.
beta_prior: Beta parameter of the Beta prior.
n_heads: Number of observed heads.
n_tails: Number of observed tails.
title: Plot title.
"""
theta = np.linspace(0.001, 0.999, 500)
# Prior
prior = stats.beta.pdf(theta, alpha_prior, beta_prior)
# Posterior (conjugate update)
alpha_post = alpha_prior + n_heads
beta_post = beta_prior + n_tails
posterior = stats.beta.pdf(theta, alpha_post, beta_post)
# Likelihood (scaled to match posterior height for visualization)
log_likelihood = n_heads * np.log(theta) + n_tails * np.log(1 - theta)
likelihood = np.exp(log_likelihood - log_likelihood.max())
likelihood *= posterior.max() / likelihood.max()
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(theta, prior, "--", label=f"Prior: Beta({alpha_prior}, {beta_prior})", linewidth=2)
ax.plot(theta, likelihood, ":", label=f"Likelihood ({n_heads}H, {n_tails}T)", linewidth=2)
ax.plot(theta, posterior, "-", label=f"Posterior: Beta({alpha_post}, {beta_post})", linewidth=2)
ax.set_xlabel(r"$\theta$ (probability of heads)")
ax.set_ylabel("Density")
ax.set_title(title or f"Beta-Binomial update: {n_heads}H, {n_tails}T")
ax.legend()
ax.set_xlim(0, 1)
plt.tight_layout()
plt.show()
# Weak prior: Beta(1, 1) = Uniform
plot_prior_likelihood_posterior(1, 1, 7, 3, "Weak prior: Beta(1,1)")
# Strong prior: casino inspector
plot_prior_likelihood_posterior(625, 625, 7, 3, "Strong prior: Beta(625, 625)")
The two plots tell the entire story of Bayesian inference. With a weak prior, the posterior is dominated by the data. With a strong prior based on genuine knowledge, the posterior respects both the prior and the data, weighting each by its precision. Neither plot is "wrong." They answer different questions given different states of knowledge.
Common Misconception: "Priors are subjective, so Bayesian analysis is subjective." All statistical analysis involves choices — the model, the features, the significance threshold, the loss function. A frequentist who chooses to run a one-sided test at $\alpha = 0.10$ has made a subjective decision that affects the conclusion just as much as a prior does. Bayesian inference makes its assumptions explicit through the prior; frequentist inference buries them in the model specification and testing procedure.
20.3 The Mechanics: Prior, Likelihood, Posterior
Prior Distributions
The prior $p(\theta)$ encodes what you know about the parameter before observing data. Selecting the prior is the most discussed aspect of Bayesian inference, and the source of most objections from skeptics. We will address prior selection systematically in Section 20.6. For now, the key taxonomy:
| Prior type | Description | Example | When to use |
|---|---|---|---|
| Uninformative | Attempts to "let the data speak" by being maximally diffuse | $\text{Beta}(1, 1)$, $\mathcal{N}(0, 10^6)$ | When you truly lack prior knowledge and want results close to MLE |
| Weakly informative | Rules out implausible values but remains broad | $\mathcal{N}(0, 10)$ for a standardized coefficient | Default choice in most applied work; provides mild regularization |
| Informative | Encodes specific domain knowledge | $\text{Beta}(625, 625)$ for a manufactured coin | When you have genuine prior evidence (meta-analyses, physical constraints, historical data) |
Likelihood
The likelihood $p(D \mid \theta)$ is the probability of the observed data under a specific parameter value. It is the same function that appears in maximum likelihood estimation — the difference is what you do with it. MLE maximizes the likelihood over $\theta$. Bayesian inference multiplies the likelihood by the prior and normalizes.
For $n$ independent observations $D = \{x_1, x_2, \ldots, x_n\}$:
$$p(D \mid \theta) = \prod_{i=1}^{n} p(x_i \mid \theta)$$
And the log-likelihood (which we will use repeatedly):
$$\log p(D \mid \theta) = \sum_{i=1}^{n} \log p(x_i \mid \theta)$$
Posterior Distributions
The posterior $p(\theta \mid D)$ is a distribution, not a point estimate. This is the fundamental difference from frequentist inference: the output is a complete probability distribution over the parameter space, encoding both the best guess and the uncertainty about that guess.
From the posterior you can extract:
- Posterior mean: $\mathbb{E}[\theta \mid D] = \int \theta \, p(\theta \mid D) \, d\theta$ — the expected value of $\theta$
- Posterior mode (MAP): $\hat{\theta}_{\text{MAP}} = \arg\max_\theta \, p(\theta \mid D)$ — the most probable value
- Credible interval: an interval $[a, b]$ such that $\int_a^b p(\theta \mid D) \, d\theta = 1 - \alpha$
- Highest Posterior Density Interval (HPDI): the narrowest interval containing $1 - \alpha$ probability mass
from scipy.optimize import minimize_scalar
def summarize_posterior(
alpha: float, beta_param: float, credible_level: float = 0.95
) -> dict:
"""Compute summary statistics for a Beta posterior.
Args:
alpha: Alpha parameter of the Beta posterior.
beta_param: Beta parameter of the Beta posterior.
credible_level: Probability mass for credible intervals.
Returns:
Dictionary with posterior mean, mode, equal-tailed CI, and HPDI.
"""
dist = stats.beta(alpha, beta_param)
tail = (1 - credible_level) / 2
# Mean and mode
mean = alpha / (alpha + beta_param)
mode = (alpha - 1) / (alpha + beta_param - 2) if alpha > 1 and beta_param > 1 else np.nan
# Equal-tailed credible interval
et_lower = dist.ppf(tail)
et_upper = dist.ppf(1 - tail)
# HPDI: find the narrowest interval containing credible_level mass
# For unimodal distributions, HPDI is shorter than the equal-tailed interval
def interval_width(lower_tail: float) -> float:
return dist.ppf(lower_tail + credible_level) - dist.ppf(lower_tail)
result = minimize_scalar(
interval_width,
bounds=(0, 1 - credible_level),
method="bounded"
)
hpdi_lower = dist.ppf(result.x)
hpdi_upper = dist.ppf(result.x + credible_level)
return {
"mean": mean,
"mode": mode,
"equal_tailed_ci": (et_lower, et_upper),
"hpdi": (hpdi_lower, hpdi_upper),
}
# Posterior from weak prior + 7/10 heads: Beta(8, 4)
summary_weak = summarize_posterior(8, 4)
print("Weak prior posterior (Beta(8, 4)):")
print(f" Mean: {summary_weak['mean']:.4f}")
print(f" Mode: {summary_weak['mode']:.4f}")
print(f" 95% ET CI: [{summary_weak['equal_tailed_ci'][0]:.4f}, "
f"{summary_weak['equal_tailed_ci'][1]:.4f}]")
print(f" 95% HPDI: [{summary_weak['hpdi'][0]:.4f}, {summary_weak['hpdi'][1]:.4f}]")
# Posterior from strong prior + 7/10 heads: Beta(632, 628)
summary_strong = summarize_posterior(632, 628)
print("\nStrong prior posterior (Beta(632, 628)):")
print(f" Mean: {summary_strong['mean']:.4f}")
print(f" Mode: {summary_strong['mode']:.4f}")
print(f" 95% ET CI: [{summary_strong['equal_tailed_ci'][0]:.4f}, "
f"{summary_strong['equal_tailed_ci'][1]:.4f}]")
print(f" 95% HPDI: [{summary_strong['hpdi'][0]:.4f}, {summary_strong['hpdi'][1]:.4f}]")
Weak prior posterior (Beta(8, 4)):
Mean: 0.6667
Mode: 0.7000
95% ET CI: [0.3899, 0.8979]
95% HPDI: [0.3994, 0.9043]
Strong prior posterior (Beta(632, 628)):
Mean: 0.5016
Mode: 0.5016
95% ET CI: [0.4738, 0.5293]
95% HPDI: [0.4738, 0.5293]
Prior Predictive and Posterior Predictive Distributions
Two distributions that are critical for model checking but often overlooked:
Prior predictive distribution. Before seeing any data, what kind of data does your model-plus-prior combination predict?
$$p(\tilde{x}) = \int p(\tilde{x} \mid \theta) \, p(\theta) \, d\theta$$
If the prior predictive assigns high probability to data that is physically impossible or empirically absurd, your prior is poorly chosen. This check catches bad priors before you waste computation on inference.
Posterior predictive distribution. After observing data $D$, what does the model predict for new observations?
$$p(\tilde{x} \mid D) = \int p(\tilde{x} \mid \theta) \, p(\theta \mid D) \, d\theta$$
The posterior predictive integrates over posterior uncertainty in $\theta$. This is the Bayesian way to make predictions: instead of plugging in a point estimate $\hat{\theta}$, you average over all plausible parameter values, weighted by their posterior probability. The result is predictions that account for parameter uncertainty — wider prediction intervals when you are less certain about $\theta$.
def prior_vs_posterior_predictive(
alpha_prior: float,
beta_prior: float,
n_heads: int,
n_tails: int,
n_future: int = 20,
n_samples: int = 50000,
seed: int = 42,
) -> Tuple[np.ndarray, np.ndarray]:
"""Compare prior and posterior predictive distributions.
Simulates the number of heads in n_future future coin flips under
the prior and posterior predictive distributions.
Args:
alpha_prior: Alpha parameter of the Beta prior.
beta_prior: Beta parameter of the Beta prior.
n_heads: Observed heads.
n_tails: Observed tails.
n_future: Number of future flips to predict.
n_samples: Number of Monte Carlo samples.
seed: Random seed.
Returns:
Tuple of (prior_predictive_samples, posterior_predictive_samples).
"""
rng = np.random.RandomState(seed)
# Prior predictive: sample theta from prior, then sample data
theta_prior = rng.beta(alpha_prior, beta_prior, n_samples)
prior_pred = rng.binomial(n_future, theta_prior)
# Posterior predictive: sample theta from posterior, then sample data
alpha_post = alpha_prior + n_heads
beta_post = beta_prior + n_tails
theta_post = rng.beta(alpha_post, beta_post, n_samples)
posterior_pred = rng.binomial(n_future, theta_post)
return prior_pred, posterior_pred
prior_pred, posterior_pred = prior_vs_posterior_predictive(1, 1, 7, 3)
fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharey=True)
bins = np.arange(-0.5, 21.5, 1)
axes[0].hist(prior_pred, bins=bins, density=True, alpha=0.7, color="steelblue")
axes[0].set_title("Prior Predictive: Next 20 flips")
axes[0].set_xlabel("Number of heads")
axes[0].set_ylabel("Probability")
axes[1].hist(posterior_pred, bins=bins, density=True, alpha=0.7, color="darkorange")
axes[1].set_title("Posterior Predictive: Next 20 flips")
axes[1].set_xlabel("Number of heads")
plt.tight_layout()
plt.show()
The prior predictive under $\text{Beta}(1, 1)$ is nearly uniform — any number of heads from 0 to 20 is roughly equally likely, reflecting total ignorance. The posterior predictive after observing 7/10 heads concentrates around 12-15 heads in the next 20 flips, reflecting what we have learned. If the posterior predictive does not resemble the actual patterns in your data, your model is wrong.
20.4 Conjugate Priors: When the Math Works Out
A prior is conjugate to a likelihood if the posterior belongs to the same distributional family as the prior. Conjugacy is a mathematical convenience: it gives closed-form posteriors that can be computed with pencil and paper, without numerical integration or sampling.
Beta-Binomial Conjugacy
Likelihood: $D \sim \text{Binomial}(n, \theta)$, so $p(D \mid \theta) = \binom{n}{k} \theta^k (1 - \theta)^{n-k}$
Prior: $\theta \sim \text{Beta}(\alpha, \beta)$, so $p(\theta) \propto \theta^{\alpha - 1} (1 - \theta)^{\beta - 1}$
Derivation of the posterior:
$$p(\theta \mid D) \propto p(D \mid \theta) \, p(\theta)$$ $$= \binom{n}{k} \theta^k (1 - \theta)^{n-k} \cdot \frac{1}{B(\alpha, \beta)} \theta^{\alpha-1} (1 - \theta)^{\beta-1}$$
Drop constants that do not depend on $\theta$:
$$\propto \theta^{k + \alpha - 1} (1 - \theta)^{n - k + \beta - 1}$$
This is the kernel of a Beta distribution. Therefore:
$$\boxed{\theta \mid D \sim \text{Beta}(\alpha + k, \, \beta + n - k)}$$
The update rule is remarkably intuitive: $\alpha$ counts "prior successes" and $\beta$ counts "prior failures." After observing $k$ successes in $n$ trials, add $k$ to $\alpha$ and $n - k$ to $\beta$. The prior parameters $\alpha$ and $\beta$ act as pseudo-counts — equivalent to having already observed $\alpha - 1$ successes and $\beta - 1$ failures before the experiment began.
The posterior mean is:
$$\mathbb{E}[\theta \mid D] = \frac{\alpha + k}{\alpha + \beta + n}$$
This can be rewritten as a weighted average of the prior mean and the MLE:
$$\mathbb{E}[\theta \mid D] = \underbrace{\frac{\alpha + \beta}{\alpha + \beta + n}}_{\text{weight on prior}} \cdot \underbrace{\frac{\alpha}{\alpha + \beta}}_{\text{prior mean}} + \underbrace{\frac{n}{\alpha + \beta + n}}_{\text{weight on data}} \cdot \underbrace{\frac{k}{n}}_{\text{MLE}}$$
As $n \to \infty$, the weight on the data approaches 1, and the posterior mean converges to the MLE regardless of the prior. The data overwhelm the prior. This is one sense in which "the prior does not matter" for large datasets — but for small datasets, the prior matters a great deal.
Normal-Normal Conjugacy
Likelihood: $x_1, \ldots, x_n \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)$ with $\sigma^2$ known
Prior: $\mu \sim \mathcal{N}(\mu_0, \sigma_0^2)$
Derivation of the posterior:
$$p(\mu \mid D) \propto p(D \mid \mu) \, p(\mu)$$
$$\propto \exp\left(-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2\right) \cdot \exp\left(-\frac{1}{2\sigma_0^2}(\mu - \mu_0)^2\right)$$
Combine the exponents. After completing the square in $\mu$ (collecting all terms involving $\mu$ and $\mu^2$):
$$p(\mu \mid D) \propto \exp\left(-\frac{1}{2\sigma_n^2}(\mu - \mu_n)^2\right)$$
where
$$\frac{1}{\sigma_n^2} = \frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}$$
$$\mu_n = \sigma_n^2 \left(\frac{n \bar{x}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2}\right)$$
Therefore:
$$\boxed{\mu \mid D \sim \mathcal{N}(\mu_n, \sigma_n^2)}$$
The posterior precision (inverse variance) is the sum of the data precision and the prior precision. The posterior mean is a precision-weighted average of the prior mean and the sample mean. Again, as $n \to \infty$, the data precision dominates, and the posterior concentrates around $\bar{x}$.
def normal_normal_update(
prior_mu: float,
prior_sigma: float,
likelihood_sigma: float,
data: np.ndarray,
) -> Tuple[float, float]:
"""Compute Normal-Normal conjugate posterior.
Args:
prior_mu: Prior mean.
prior_sigma: Prior standard deviation.
likelihood_sigma: Known standard deviation of the data.
data: Observed data points.
Returns:
Tuple of (posterior_mean, posterior_std).
"""
n = len(data)
xbar = data.mean()
prior_precision = 1.0 / prior_sigma**2
data_precision = n / likelihood_sigma**2
posterior_precision = prior_precision + data_precision
posterior_sigma = 1.0 / np.sqrt(posterior_precision)
posterior_mu = (prior_precision * prior_mu + data_precision * xbar) / posterior_precision
return posterior_mu, posterior_sigma
# Example: estimating average session duration (minutes)
# Prior: average user watches about 25 min, uncertainty of 10 min
# Data: 15 users observed, known population sigma = 12 min
rng = np.random.RandomState(42)
observed_sessions = rng.normal(loc=30, scale=12, size=15)
post_mu, post_sigma = normal_normal_update(
prior_mu=25.0, prior_sigma=10.0, likelihood_sigma=12.0, data=observed_sessions
)
mle = observed_sessions.mean()
print(f"Prior: mu = 25.00, sigma = 10.00")
print(f"MLE: {mle:.2f}")
print(f"Posterior: mu = {post_mu:.2f}, sigma = {post_sigma:.2f}")
print(f"95% CI: [{post_mu - 1.96*post_sigma:.2f}, {post_mu + 1.96*post_sigma:.2f}]")
Prior: mu = 25.00, sigma = 10.00
MLE: 30.09
Posterior: mu = 29.67, sigma = 2.99
95% CI: [23.81, 35.53]
Summary of Common Conjugate Pairs
| Likelihood | Conjugate Prior | Posterior | Interpretation |
|---|---|---|---|
| Binomial($n, \theta$) | Beta($\alpha, \beta$) | Beta($\alpha + k, \beta + n - k$) | Add observed successes/failures to pseudo-counts |
| Poisson($\lambda$) | Gamma($\alpha, \beta$) | Gamma($\alpha + \sum x_i, \beta + n$) | Add observed counts and sample size |
| Normal($\mu$, known $\sigma^2$) | Normal($\mu_0, \sigma_0^2$) | Normal($\mu_n, \sigma_n^2$) | Precision-weighted average of prior and data |
| Normal(known $\mu$, $\sigma^2$) | Inverse-Gamma($\alpha, \beta$) | Inverse-Gamma($\alpha + n/2, \beta + \text{SS}/2$) | Add degrees of freedom and sum of squares |
| Exponential($\lambda$) | Gamma($\alpha, \beta$) | Gamma($\alpha + n, \beta + \sum x_i$) | Add observation count and total waiting time |
| Multinomial($\boldsymbol{\theta}$) | Dirichlet($\boldsymbol{\alpha}$) | Dirichlet($\boldsymbol{\alpha} + \mathbf{k}$) | Add observed category counts to pseudo-counts |
Mathematical Foundation: Conjugacy exists because the likelihood and prior belong to the same exponential family. The exponential family form is $p(x \mid \theta) = h(x) \exp(\eta(\theta)^\top T(x) - A(\theta))$, where $T(x)$ is the sufficient statistic. The conjugate prior has the form $p(\theta) \propto \exp(\eta(\theta)^\top \boldsymbol{\chi} - \nu A(\theta))$, where $\boldsymbol{\chi}$ and $\nu$ are hyperparameters. After observing $n$ data points, the posterior updates $\boldsymbol{\chi} \to \boldsymbol{\chi} + \sum T(x_i)$ and $\nu \to \nu + n$. Every conjugate pair in the table above is a special case of this update.
Why Conjugacy Is Not Enough
Conjugacy gives closed-form posteriors for a limited set of models. Most real-world models are not conjugate: logistic regression, neural networks, hierarchical models with non-conjugate priors, mixture models. For these, you need computational methods — Markov chain Monte Carlo (MCMC) or variational inference. Chapter 21 covers these methods in depth.
But conjugate models are not merely pedagogical toys. They serve as building blocks for more complex models (the Dirichlet-Multinomial is the foundation of topic models), they provide exact posteriors for testing approximate inference algorithms, and they are computationally efficient for real-time applications where MCMC would be too slow. StreamRec's per-user preference estimation, which we build in this chapter's progressive project, is a conjugate model that updates in microseconds.
20.5 The MAP-MLE-Regularization Triangle
One of the most important insights in applied machine learning is the connection between three ideas that are usually taught in different courses: maximum likelihood estimation (from statistics), regularization (from machine learning), and maximum a posteriori estimation (from Bayesian statistics). They are the same thing viewed from different angles.
MAP Estimation
The MAP estimate maximizes the posterior:
$$\hat{\theta}_{\text{MAP}} = \arg\max_\theta \, p(\theta \mid D) = \arg\max_\theta \, p(D \mid \theta) \, p(\theta)$$
Taking the log:
$$\hat{\theta}_{\text{MAP}} = \arg\max_\theta \left[\log p(D \mid \theta) + \log p(\theta)\right]$$
This is maximizing the log-likelihood plus a penalty term that comes from the prior. The prior acts as a regularizer.
Gaussian Prior = L2 Regularization
Suppose the prior is $\theta_j \overset{\text{iid}}{\sim} \mathcal{N}(0, \tau^2)$ for each parameter $\theta_j$. Then:
$$\log p(\theta) = \sum_j \log p(\theta_j) = \sum_j \left(-\frac{1}{2}\log(2\pi\tau^2) - \frac{\theta_j^2}{2\tau^2}\right)$$
Drop the constant:
$$\log p(\theta) \propto -\frac{1}{2\tau^2} \sum_j \theta_j^2 = -\frac{1}{2\tau^2} \|\theta\|_2^2$$
Therefore the MAP objective is:
$$\hat{\theta}_{\text{MAP}} = \arg\max_\theta \left[\log p(D \mid \theta) - \frac{1}{2\tau^2}\|\theta\|_2^2\right]$$
Or equivalently, minimizing the negative:
$$\hat{\theta}_{\text{MAP}} = \arg\min_\theta \left[-\log p(D \mid \theta) + \frac{\lambda}{2}\|\theta\|_2^2\right]$$
where $\lambda = 1/\tau^2$. This is exactly L2-regularized loss minimization (ridge regression, weight decay). The regularization strength $\lambda$ is the inverse of the prior variance: a tighter prior (smaller $\tau^2$) means stronger regularization (larger $\lambda$).
Laplace Prior = L1 Regularization
Suppose the prior is $\theta_j \overset{\text{iid}}{\sim} \text{Laplace}(0, b)$ for each parameter. The Laplace density is $p(\theta_j) = \frac{1}{2b}\exp\left(-\frac{|\theta_j|}{b}\right)$. Then:
$$\log p(\theta) \propto -\frac{1}{b}\sum_j |\theta_j| = -\frac{1}{b}\|\theta\|_1$$
The MAP objective becomes:
$$\hat{\theta}_{\text{MAP}} = \arg\min_\theta \left[-\log p(D \mid \theta) + \frac{\lambda}{1}\|\theta\|_1\right]$$
where $\lambda = 1/b$. This is L1-regularized loss minimization (lasso). The Laplace distribution has heavier tails and a sharper peak at zero compared to the Gaussian, which is why L1 regularization produces sparse solutions — parameters are pulled toward exactly zero.
The Triangle
| Method | Objective | Prior | Bayesian output |
|---|---|---|---|
| MLE | $\arg\max_\theta \log p(D \mid \theta)$ | Flat (uniform) prior | Point estimate only |
| MAP + Gaussian prior | $\arg\max_\theta [\log p(D \mid \theta) - \frac{\lambda}{2}\|\theta\|_2^2]$ | $\mathcal{N}(0, 1/\lambda)$ | Point estimate (= ridge regression) |
| MAP + Laplace prior | $\arg\max_\theta [\log p(D \mid \theta) - \lambda\|\theta\|_1]$ | $\text{Laplace}(0, 1/\lambda)$ | Point estimate (= lasso) |
| Full Bayesian | $p(\theta \mid D) \propto p(D \mid \theta) p(\theta)$ | Any | Full posterior distribution |
Research Insight: MAP estimation is sometimes criticized as "not really Bayesian" because it discards the posterior distribution in favor of a point estimate. This criticism has merit: the MAP estimate loses uncertainty information and can be sensitive to parameterization (the MAP estimate of $\theta$ is not the same as the MAP estimate of $\log\theta$, unlike the posterior mean). However, MAP provides the critical conceptual bridge between Bayesian and frequentist thinking. Understanding that "regularization is a prior" changes how you think about both.
from typing import Callable, Optional
def map_mle_comparison(
n_samples: int = 50,
true_theta: float = 0.3,
n_features: int = 20,
seed: int = 42,
) -> dict:
"""Compare MLE, MAP-L2 (ridge), and MAP-L1 (lasso) for logistic regression.
Demonstrates that MAP with a Gaussian prior = L2 regularization (ridge)
and MAP with a Laplace prior = L1 regularization (lasso).
Args:
n_samples: Number of training examples.
true_theta: Fraction of features with nonzero true coefficients.
n_features: Number of features.
seed: Random seed.
Returns:
Dictionary with coefficient vectors from each method.
"""
from sklearn.linear_model import LogisticRegression
rng = np.random.RandomState(seed)
# True sparse coefficients
n_nonzero = int(true_theta * n_features)
true_coef = np.zeros(n_features)
true_coef[:n_nonzero] = rng.randn(n_nonzero) * 2
# Generate data
X = rng.randn(n_samples, n_features)
logits = X @ true_coef
y = (rng.rand(n_samples) < 1 / (1 + np.exp(-logits))).astype(int)
# MLE (no regularization — very large C)
lr_mle = LogisticRegression(penalty=None, max_iter=5000, solver="lbfgs")
lr_mle.fit(X, y)
# MAP with Gaussian prior (L2 = ridge)
lr_l2 = LogisticRegression(penalty="l2", C=1.0, max_iter=5000, solver="lbfgs")
lr_l2.fit(X, y)
# MAP with Laplace prior (L1 = lasso)
lr_l1 = LogisticRegression(penalty="l1", C=1.0, max_iter=5000, solver="liblinear")
lr_l1.fit(X, y)
return {
"true": true_coef,
"mle": lr_mle.coef_.flatten(),
"map_l2": lr_l2.coef_.flatten(),
"map_l1": lr_l1.coef_.flatten(),
}
results = map_mle_comparison()
fig, axes = plt.subplots(1, 4, figsize=(16, 4), sharey=True)
titles = ["True", "MLE (flat prior)", "MAP-L2 (Gaussian prior)", "MAP-L1 (Laplace prior)"]
keys = ["true", "mle", "map_l2", "map_l1"]
for ax, title, key in zip(axes, titles, keys):
coef = results[key]
colors = ["steelblue" if abs(c) > 0.01 else "lightgray" for c in coef]
ax.bar(range(len(coef)), coef, color=colors)
ax.set_title(title)
ax.set_xlabel("Feature index")
ax.axhline(0, color="black", linewidth=0.5)
n_nonzero = np.sum(np.abs(coef) > 0.01)
ax.set_ylabel("Coefficient" if key == "true" else "")
ax.annotate(f"Nonzero: {n_nonzero}", xy=(0.95, 0.95), xycoords="axes fraction",
ha="right", va="top", fontsize=9)
plt.tight_layout()
plt.show()
The plot reveals the triangle in action. MLE (flat prior) overfits: it produces large coefficients for noise features because there is no penalty for complexity. MAP with a Gaussian prior (ridge) shrinks all coefficients toward zero but keeps them nonzero. MAP with a Laplace prior (lasso) drives many coefficients exactly to zero, recovering the sparse structure. The "right" prior depends on your belief about sparsity.
20.6 Prior Selection: A Practical Framework
Prior selection is not a philosophical exercise. It is an engineering decision with measurable consequences. Here is a practical framework.
Step 1: Determine What You Know
Ask yourself three questions:
- What is the parameter's physical range? (e.g., probabilities are in $[0, 1]$; variances are positive; temperatures have physical limits)
- What is the plausible magnitude? (e.g., regression coefficients on standardized features are unlikely to exceed 5 in absolute value)
- Do you have domain knowledge or prior studies? (e.g., meta-analyses of treatment effects, historical baselines, physical constraints)
Step 2: Choose the Prior Type
Uninformative priors attempt to encode ignorance. The most common:
- Flat prior: $p(\theta) \propto 1$ over the parameter space. Not invariant under reparameterization — a flat prior on $\theta$ is not flat on $\log\theta$.
- Jeffreys prior: $p(\theta) \propto \sqrt{\det I(\theta)}$ where $I(\theta)$ is the Fisher information matrix. Invariant under reparameterization, but can be improper (does not integrate to 1) and can lead to improper posteriors in multiparameter models.
Common Misconception: "Uninformative means no prior influence." There is no such thing as a truly uninformative prior. Every prior has some influence on the posterior, especially with small samples. A "flat" prior on $[0, 1]$ says that $\theta = 0.99$ is exactly as plausible as $\theta = 0.50$ — which is rarely true. "Uninformative" is a goal, not an achievement.
Weakly informative priors are the default recommendation for most applied work (Gelman et al., 2008). They encode minimal but real constraints:
- Regression coefficients on standardized features: $\mathcal{N}(0, 2.5)$ or $\text{Student-}t(3, 0, 2.5)$
- Standard deviations: $\text{Half-Cauchy}(0, 2.5)$ or $\text{Exponential}(1)$
- Correlations: $\text{LKJ}(\eta = 2)$ (slightly favoring smaller correlations)
The goal is to rule out absurd values while remaining broad enough that the data dominate the inference. If your regression coefficient on a standardized feature is 50, something is wrong with your model, not your prior.
Informative priors encode specific knowledge. They are appropriate when:
- You have results from previous studies (meta-analytic priors)
- Physical constraints provide bounds (a drug cannot have a negative dosage)
- Regulatory or business context provides baselines (the historical conversion rate is 3% with a standard deviation of 0.5%)
Step 3: Prior Predictive Checks
After choosing a prior, simulate data from the prior predictive distribution (Section 20.3). Do the simulated data look plausible? If a prior on a conversion rate generates predictions of 95% conversion, the prior is too diffuse. If a prior on temperature change generates negative absolute temperatures, the prior violates physical constraints.
def prior_predictive_check_regression(
prior_scale: float,
n_features: int = 5,
n_simulations: int = 1000,
seed: int = 42,
) -> np.ndarray:
"""Simulate predicted outcomes under a regression model with Normal prior.
Args:
prior_scale: Standard deviation of the Normal(0, prior_scale) prior on coefficients.
n_features: Number of features.
n_simulations: Number of prior samples.
seed: Random seed.
Returns:
Array of predicted outcome ranges (max - min per simulation).
"""
rng = np.random.RandomState(seed)
# Standardized features: N(0, 1)
X = rng.randn(100, n_features)
outcome_ranges = np.empty(n_simulations)
for i in range(n_simulations):
beta = rng.randn(n_features) * prior_scale
intercept = rng.randn() * prior_scale
y = X @ beta + intercept
outcome_ranges[i] = y.max() - y.min()
return outcome_ranges
# Compare prior scales
fig, axes = plt.subplots(1, 3, figsize=(14, 4), sharey=True)
for ax, scale in zip(axes, [1.0, 10.0, 100.0]):
ranges = prior_predictive_check_regression(prior_scale=scale)
ax.hist(ranges, bins=40, alpha=0.7, color="steelblue")
ax.set_title(f"Prior scale = {scale}")
ax.set_xlabel("Outcome range (max - min)")
ax.axvline(ranges.mean(), color="red", linestyle="--", label=f"Mean: {ranges.mean():.1f}")
ax.legend()
axes[0].set_ylabel("Frequency")
plt.suptitle("Prior predictive check: How extreme are predicted outcomes?", y=1.02)
plt.tight_layout()
plt.show()
A prior scale of 1.0 on standardized features produces outcome ranges of about 5-10 units — reasonable. A prior scale of 100.0 produces ranges of 500-1000 units — implying that some people have outcomes hundreds of standard deviations from others. The prior predictive check catches this.
Step 4: Prior Sensitivity Analysis
After fitting the model, check whether the posterior is sensitive to reasonable changes in the prior. If changing from $\mathcal{N}(0, 1)$ to $\mathcal{N}(0, 10)$ dramatically changes the posterior, you do not have enough data to overwhelm the prior, and you should report results under multiple priors.
def prior_sensitivity_beta_binomial(
n_heads: int,
n_tails: int,
priors: dict,
) -> None:
"""Visualize posterior sensitivity to different priors.
Args:
n_heads: Observed successes.
n_tails: Observed failures.
priors: Dictionary mapping label to (alpha, beta) tuples.
"""
theta = np.linspace(0.001, 0.999, 500)
fig, ax = plt.subplots(figsize=(8, 5))
for label, (alpha, beta_param) in priors.items():
alpha_post = alpha + n_heads
beta_post = beta_param + n_tails
posterior = stats.beta.pdf(theta, alpha_post, beta_post)
post_mean = alpha_post / (alpha_post + beta_post)
ax.plot(theta, posterior, label=f"{label} → mean={post_mean:.3f}", linewidth=2)
ax.set_xlabel(r"$\theta$")
ax.set_ylabel("Posterior density")
ax.set_title(f"Prior sensitivity: {n_heads} heads, {n_tails} tails")
ax.legend()
plt.tight_layout()
plt.show()
# Small data: prior matters
prior_sensitivity_beta_binomial(3, 2, {
"Uniform Beta(1,1)": (1, 1),
"Jeffreys Beta(0.5,0.5)": (0.5, 0.5),
"Weakly informative Beta(2,2)": (2, 2),
"Informative Beta(10,10)": (10, 10),
})
# Large data: prior washes out
prior_sensitivity_beta_binomial(70, 30, {
"Uniform Beta(1,1)": (1, 1),
"Jeffreys Beta(0.5,0.5)": (0.5, 0.5),
"Weakly informative Beta(2,2)": (2, 2),
"Informative Beta(10,10)": (10, 10),
})
With 5 observations (3 heads, 2 tails), the choice of prior noticeably shifts the posterior mean — from 0.58 (Jeffreys) to 0.54 (informative Beta(10,10)). With 100 observations (70 heads, 30 tails), the posteriors under all four priors are nearly identical. This is the Bayesian central limit theorem in action: as data accumulate, the posterior concentrates around the MLE regardless of the prior.
Production Reality: In industry, "what prior should I use?" is rarely the bottleneck question. If you have enough data for the posterior to be prior-insensitive, any weakly informative prior is fine. If you have so little data that the prior dominates, you probably do not have enough data to train any model — and the right response is often to collect more data, not to fight about priors.
20.7 Bayesian Updating and Sequential Inference
One of the most powerful features of Bayesian inference is that updating is iterative. Today's posterior becomes tomorrow's prior.
Suppose you observe data in batches $D_1, D_2, \ldots, D_T$. You can either process all data at once:
$$p(\theta \mid D_1, D_2, \ldots, D_T) \propto p(D_1, D_2, \ldots, D_T \mid \theta) \, p(\theta)$$
Or update sequentially:
$$p(\theta \mid D_1) \propto p(D_1 \mid \theta) \, p(\theta)$$ $$p(\theta \mid D_1, D_2) \propto p(D_2 \mid \theta) \, p(\theta \mid D_1)$$ $$\vdots$$ $$p(\theta \mid D_1, \ldots, D_T) \propto p(D_T \mid \theta) \, p(\theta \mid D_1, \ldots, D_{T-1})$$
For conjugate models, both approaches give the same posterior (assuming observations are conditionally independent given $\theta$). Sequential updating is computationally efficient: you only need the current posterior parameters and the new data, not the entire history.
This is why Bayesian methods are natural for online and streaming systems. A recommendation engine that maintains a Beta posterior per user-category pair can update in constant time and memory per interaction.
from dataclasses import dataclass, field
from typing import List, Tuple
@dataclass
class BetaBayesianTracker:
"""Tracks a Bernoulli rate parameter using Beta-Binomial conjugacy.
Supports sequential updating and maintains the full posterior history
for visualization.
Attributes:
alpha: Current alpha parameter of the Beta posterior.
beta_param: Current beta parameter of the Beta posterior.
history: List of (alpha, beta_param) tuples after each update.
"""
alpha: float = 1.0
beta_param: float = 1.0
history: List[Tuple[float, float]] = field(default_factory=list)
def __post_init__(self) -> None:
self.history.append((self.alpha, self.beta_param))
@property
def mean(self) -> float:
"""Posterior mean."""
return self.alpha / (self.alpha + self.beta_param)
@property
def variance(self) -> float:
"""Posterior variance."""
total = self.alpha + self.beta_param
return (self.alpha * self.beta_param) / (total**2 * (total + 1))
@property
def n_observations(self) -> int:
"""Effective number of observations (excluding prior pseudo-counts)."""
a0, b0 = self.history[0]
return int((self.alpha - a0) + (self.beta_param - b0))
def update(self, successes: int, failures: int) -> None:
"""Update posterior with new observations.
Args:
successes: Number of new successes.
failures: Number of new failures.
"""
self.alpha += successes
self.beta_param += failures
self.history.append((self.alpha, self.beta_param))
def credible_interval(self, level: float = 0.95) -> Tuple[float, float]:
"""Compute equal-tailed credible interval.
Args:
level: Credible level (e.g., 0.95 for 95%).
Returns:
(lower, upper) bounds.
"""
tail = (1 - level) / 2
dist = stats.beta(self.alpha, self.beta_param)
return (dist.ppf(tail), dist.ppf(1 - tail))
def plot_history(self) -> None:
"""Plot posterior evolution over time."""
means = [a / (a + b) for a, b in self.history]
cis_low = []
cis_high = []
for a, b in self.history:
dist = stats.beta(a, b)
cis_low.append(dist.ppf(0.025))
cis_high.append(dist.ppf(0.975))
steps = range(len(self.history))
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(steps, means, "o-", color="steelblue", label="Posterior mean")
ax.fill_between(steps, cis_low, cis_high, alpha=0.2, color="steelblue",
label="95% credible interval")
ax.set_xlabel("Update step")
ax.set_ylabel(r"$\theta$")
ax.set_title("Sequential Bayesian updating")
ax.legend()
plt.tight_layout()
plt.show()
# Simulate sequential updates for a user's click rate
rng = np.random.RandomState(42)
true_rate = 0.35
tracker = BetaBayesianTracker(alpha=1.0, beta_param=1.0) # Uniform prior
for batch in range(20):
batch_size = rng.randint(5, 20)
clicks = rng.binomial(batch_size, true_rate)
tracker.update(successes=clicks, failures=batch_size - clicks)
print(f"After {tracker.n_observations} observations:")
print(f" True rate: {true_rate:.3f}")
print(f" Posterior mean: {tracker.mean:.3f}")
print(f" 95% CI: [{tracker.credible_interval()[0]:.3f}, "
f"{tracker.credible_interval()[1]:.3f}]")
tracker.plot_history()
After 236 observations:
True rate: 0.350
Posterior mean: 0.345
95% CI: [0.287, 0.405]
The credible interval tightens with each batch of data, reflecting the accumulation of evidence. After 236 observations, the posterior mean is within 0.005 of the true rate, and the 95% credible interval is about $\pm 0.06$ around the mean.
20.8 Credible Intervals vs. Confidence Intervals
The two interval types are frequently confused. They answer different questions and have different interpretations.
Frequentist confidence interval (CI): "If I repeated this experiment infinitely many times and computed a 95% CI each time, 95% of those intervals would contain the true parameter." The specific interval $[0.42, 0.98]$ either contains $\theta$ or it does not — the probability is 0 or 1. The "95%" refers to the procedure, not to this interval.
Bayesian credible interval: "Given the data I observed, there is a 95% probability that $\theta$ lies in $[0.39, 0.90]$." This is a probability statement about $\theta$ given the data — exactly what most practitioners want.
The difference matters in practice when:
- Communicating to non-statisticians. "There's a 95% chance the conversion rate is between 3% and 7%" (Bayesian interpretation) is easier to understand and act on than "If we repeated this experiment many times, 95% of the resulting intervals would contain the true rate."
- Small samples. Frequentist CIs can include impossible values (e.g., negative probabilities from the Wald interval). Bayesian credible intervals respect the prior's support automatically.
- Decision-making. "What is the probability that the treatment effect is positive?" has a direct Bayesian answer: $P(\theta > 0 \mid D) = \int_0^\infty p(\theta \mid D) \, d\theta$. The frequentist framework does not assign probabilities to parameter values.
def compare_intervals(
alpha_post: float,
beta_post: float,
n: int,
k: int,
) -> None:
"""Compare frequentist CI and Bayesian credible interval for a proportion.
Args:
alpha_post: Posterior alpha.
beta_post: Posterior beta.
n: Number of trials.
k: Number of successes.
"""
# Frequentist: Wald interval
p_hat = k / n
se = np.sqrt(p_hat * (1 - p_hat) / n)
wald_ci = (p_hat - 1.96 * se, p_hat + 1.96 * se)
# Frequentist: Wilson interval (better for small n)
z = 1.96
denom = 1 + z**2 / n
center = (p_hat + z**2 / (2 * n)) / denom
margin = z * np.sqrt(p_hat * (1 - p_hat) / n + z**2 / (4 * n**2)) / denom
wilson_ci = (center - margin, center + margin)
# Bayesian: equal-tailed credible interval
dist = stats.beta(alpha_post, beta_post)
bayes_ci = (dist.ppf(0.025), dist.ppf(0.975))
# Bayesian: HPDI
from scipy.optimize import minimize_scalar
def width(lower_tail: float) -> float:
return dist.ppf(lower_tail + 0.95) - dist.ppf(lower_tail)
opt = minimize_scalar(width, bounds=(0, 0.05), method="bounded")
hpdi = (dist.ppf(opt.x), dist.ppf(opt.x + 0.95))
print(f"Data: {k}/{n} successes")
print(f" Wald CI (95%): [{wald_ci[0]:.4f}, {wald_ci[1]:.4f}]")
print(f" Wilson CI (95%): [{wilson_ci[0]:.4f}, {wilson_ci[1]:.4f}]")
print(f" Bayesian ET (95%): [{bayes_ci[0]:.4f}, {bayes_ci[1]:.4f}]")
print(f" Bayesian HPDI: [{hpdi[0]:.4f}, {hpdi[1]:.4f}]")
# Small sample: differences are visible
compare_intervals(alpha_post=4, beta_post=8, n=10, k=3)
print()
# Large sample: intervals converge
compare_intervals(alpha_post=301, beta_post=701, n=1000, k=300)
Data: 3/10 successes
Wald CI (95%): [0.0161, 0.5839]
Wilson CI (95%): [0.1077, 0.6033]
Bayesian ET (95%): [0.1195, 0.5765]
Bayesian HPDI: [0.1068, 0.5608]
Data: 300/1000 successes
Wald CI (95%): [0.2716, 0.3284]
Wilson CI (95%): [0.2719, 0.3289]
Bayesian ET (95%): [0.2716, 0.3285]
Bayesian HPDI: [0.2716, 0.3285]
With 10 observations, the Wald CI extends to 0.016 — nearly zero — while the Bayesian interval's lower bound is 0.12, because the Beta posterior appropriately handles the uncertainty of small-sample proportions. With 1000 observations, all four intervals agree to within rounding error. The Bayesian-frequentist distinction vanishes with sufficient data.
20.9 Bayesian Model Comparison
Beyond parameter estimation, Bayesian inference provides a principled framework for comparing models.
Bayes Factors
Given two models $M_1$ and $M_2$ and observed data $D$, the Bayes factor is:
$$\text{BF}_{12} = \frac{p(D \mid M_1)}{p(D \mid M_2)}$$
where $p(D \mid M_k) = \int p(D \mid \theta_k, M_k) \, p(\theta_k \mid M_k) \, d\theta_k$ is the marginal likelihood (evidence) under model $M_k$.
The Bayes factor has a natural interpretation as an evidence ratio: $\text{BF}_{12} = 10$ means the data are 10 times more probable under $M_1$ than $M_2$.
| $\text{BF}_{12}$ | Evidence strength |
|---|---|
| 1 to 3 | Barely worth mentioning |
| 3 to 10 | Moderate |
| 10 to 30 | Strong |
| 30 to 100 | Very strong |
| > 100 | Decisive |
(Table adapted from Kass and Raftery, 1995)
The Automatic Occam's Razor
The marginal likelihood has a built-in complexity penalty. Consider a simple model $M_1$ that predicts data in a narrow range, and a complex model $M_2$ that can predict data in a wide range. If the observed data falls in the narrow range:
- $M_1$ assigns high probability to the observed data (concentrated prediction)
- $M_2$ assigns moderate probability (its probability is spread over a larger space)
$M_1$ wins the Bayes factor comparison, even though $M_2$ can "explain" the data. The marginal likelihood automatically penalizes models that spread their predictions too thinly — this is the Bayesian Occam's razor.
def compute_bayes_factor_beta_binomial(
k: int,
n: int,
model1_prior: Tuple[float, float],
model2_prior: Tuple[float, float],
) -> float:
"""Compute Bayes factor comparing two Beta-Binomial models.
The marginal likelihood for the Beta-Binomial model is:
p(D | M) = C(n,k) * B(alpha + k, beta + n - k) / B(alpha, beta)
where B is the Beta function.
Args:
k: Number of successes.
n: Number of trials.
model1_prior: (alpha, beta) for model 1's Beta prior.
model2_prior: (alpha, beta) for model 2's Beta prior.
Returns:
Bayes factor BF_12 (evidence for model 1 over model 2).
"""
from scipy.special import betaln, comb
def log_marginal(alpha: float, beta_param: float) -> float:
"""Log marginal likelihood for Beta-Binomial."""
return (
betaln(alpha + k, beta_param + n - k)
- betaln(alpha, beta_param)
)
log_ml1 = log_marginal(*model1_prior)
log_ml2 = log_marginal(*model2_prior)
return np.exp(log_ml1 - log_ml2)
# Compare: is the coin fair (M1: tight prior around 0.5)
# vs. biased (M2: broad prior)?
bf = compute_bayes_factor_beta_binomial(
k=53, n=100,
model1_prior=(50, 50), # M1: strongly believes fair
model2_prior=(1, 1), # M2: any bias is possible
)
print(f"BF (fair vs. any): {bf:.2f}")
print(f"Interpretation: {'Evidence for fair coin' if bf > 1 else 'Evidence against fair coin'}")
bf2 = compute_bayes_factor_beta_binomial(
k=73, n=100,
model1_prior=(50, 50),
model2_prior=(1, 1),
)
print(f"\nBF (fair vs. any) with 73/100 heads: {bf2:.4f}")
print(f"Interpretation: {'Evidence for fair coin' if bf2 > 1 else 'Strong evidence against fair coin'}")
BF (fair vs. any): 5.43
Interpretation: Evidence for fair coin
BF (fair vs. any) with 73/100 heads: 0.0003
Interpretation: Strong evidence against fair coin
With 53/100 heads, the Bayes factor is 5.43: moderate evidence for the fair-coin model. With 73/100 heads, the Bayes factor plummets to 0.0003 (equivalently, $\text{BF}_{21} \approx 3300$): decisive evidence that the coin is not fair.
Production Reality: Bayes factors are theoretically elegant but computationally challenging for complex models. Computing the marginal likelihood requires integrating over the entire parameter space — a high-dimensional integral that is intractable for most real-world models. In practice, approximations (bridge sampling, WAIC, LOO-CV via PSIS-LOO) are used instead. Chapter 21 covers these practical model comparison tools.
20.10 When Bayesian Methods Add Value
We promised a balanced answer to "when should I use Bayesian methods?" Here it is, structured as a decision framework.
Use Bayesian Methods When:
-
You have genuine prior knowledge and limited data. A pharmaceutical company testing a new formulation of an existing drug has decades of prior data on the compound. Encoding this as an informative prior can yield useful posteriors from a 50-patient pilot study that would be uninformative under a frequentist analysis.
-
You need calibrated uncertainty on parameters (not just predictions). "The treatment effect is between 0.5 and 2.3 with 95% probability" is a direct Bayesian output. Frequentist confidence intervals require careful interpretation and do not answer this question directly.
-
The problem has hierarchical structure. Estimating click rates for 10,000 content categories, each with different sample sizes? Bayesian hierarchical models provide partial pooling: categories with few observations borrow strength from the overall distribution. This is the "James-Stein estimator on steroids."
-
You need sequential updating. Bayesian inference is naturally sequential: today's posterior is tomorrow's prior. This is ideal for online systems (recommendation engines, ad targeting, dynamic pricing) where data arrives continuously and decisions must be made in real time.
-
You need to incorporate constraints. Physical constraints (positive quantities), domain constraints (monotonic dose-response), or regulatory constraints (the prior must reflect existing evidence) are naturally encoded as prior distributions.
Use Frequentist Methods When:
-
You have abundant data and the prior washes out. With 10 million observations, the posterior is effectively the likelihood scaled by a constant. The prior adds computational cost without changing the answer.
-
You need only prediction, not parameter inference. If the goal is $\hat{y}$, not $p(\theta \mid D)$, frequentist methods (including deep learning) are typically faster and scale better.
-
Computational cost matters and the model is complex. MCMC for a hierarchical model with 100K parameters can take hours. A frequentist alternative (mixed-effects models, regularized MLE) may give a sufficiently close answer in seconds.
-
Regulatory or organizational norms require frequentist reporting. Clinical trials report p-values and confidence intervals because the FDA requires them. You may conduct a Bayesian analysis alongside, but the primary analysis must follow the pre-registered frequentist protocol.
def bayesian_value_assessment(
n_data: int,
has_prior_knowledge: bool,
needs_uncertainty: bool,
hierarchical_structure: bool,
sequential_updates: bool,
computational_budget_hours: float,
model_complexity: str = "moderate",
) -> str:
"""Simple decision framework for Bayesian vs. frequentist.
Args:
n_data: Number of observations.
has_prior_knowledge: Whether domain expertise exists.
needs_uncertainty: Whether calibrated parameter uncertainty is required.
hierarchical_structure: Whether data has grouped/hierarchical structure.
sequential_updates: Whether real-time updating is needed.
computational_budget_hours: Available compute time.
model_complexity: "simple", "moderate", or "complex".
Returns:
Recommendation string.
"""
score = 0
# Data size: Bayesian value decreases with more data
if n_data < 100:
score += 3
elif n_data < 1000:
score += 2
elif n_data < 10000:
score += 1
# else: no Bayesian advantage from data size
# Prior knowledge
if has_prior_knowledge:
score += 2
# Uncertainty requirements
if needs_uncertainty:
score += 2
# Hierarchical structure
if hierarchical_structure:
score += 2
# Sequential updates
if sequential_updates:
score += 1
# Computational feasibility penalty
complexity_cost = {"simple": 0, "moderate": -1, "complex": -2}[model_complexity]
if computational_budget_hours < 0.1:
complexity_cost -= 1
score += complexity_cost
if score >= 5:
return "STRONG BAYESIAN CASE: The problem characteristics favor Bayesian methods."
elif score >= 3:
return "MODERATE BAYESIAN CASE: Bayesian methods may add value; run both and compare."
else:
return "WEAK BAYESIAN CASE: Frequentist or ML methods are likely sufficient."
# StreamRec cold-start users: strong Bayesian case
print("StreamRec cold-start:")
print(bayesian_value_assessment(
n_data=5, has_prior_knowledge=True, needs_uncertainty=True,
hierarchical_structure=True, sequential_updates=True,
computational_budget_hours=0.001, model_complexity="simple"
))
# Large-scale image classification: weak Bayesian case
print("\nImage classification:")
print(bayesian_value_assessment(
n_data=1000000, has_prior_knowledge=False, needs_uncertainty=False,
hierarchical_structure=False, sequential_updates=False,
computational_budget_hours=24.0, model_complexity="complex"
))
StreamRec cold-start:
STRONG BAYESIAN CASE: The problem characteristics favor Bayesian methods.
Image classification:
WEAK BAYESIAN CASE: Frequentist or ML methods are likely sufficient.
20.11 The StreamRec Application: Bayesian User Preferences
We now apply the full Bayesian framework to the progressive project: estimating user preferences for StreamRec's content categories.
The Model
StreamRec has $C$ content categories. For each user $u$ and category $c$, we model the probability that user $u$ engages with (clicks/completes) a recommended item from category $c$ as $\theta_{u,c}$.
Prior: $\theta_{u,c} \sim \text{Beta}(\alpha_c, \beta_c)$ where $(\alpha_c, \beta_c)$ are category-specific hyperparameters estimated from the population (the empirical Bayes or hierarchical Bayesian approach).
Likelihood: Each recommendation of a category-$c$ item to user $u$ is a Bernoulli trial. After $n_{u,c}$ recommendations with $k_{u,c}$ engagements:
Posterior: $\theta_{u,c} \mid D \sim \text{Beta}(\alpha_c + k_{u,c}, \beta_c + n_{u,c} - k_{u,c})$
This model handles the cold-start problem naturally. A new user with $n_{u,c} = 0$ observations has the population prior as their posterior — reasonable default behavior. An established user with $n_{u,c} = 500$ has a posterior dominated by their personal data. The transition from "population average" to "personalized" happens automatically as data accumulates.
from dataclasses import dataclass, field
from typing import Dict, Optional
@dataclass
class UserPreferenceModel:
"""Bayesian user preference model using Beta-Binomial conjugacy.
Maintains a Beta posterior for each (user, category) pair,
enabling real-time updating and uncertainty-aware recommendations.
Attributes:
category_priors: Dictionary mapping category name to (alpha, beta) prior.
user_posteriors: Nested dict: user_posteriors[user_id][category] = (alpha, beta).
"""
category_priors: Dict[str, Tuple[float, float]]
user_posteriors: Dict[str, Dict[str, Tuple[float, float]]] = field(
default_factory=dict
)
def get_posterior(self, user_id: str, category: str) -> Tuple[float, float]:
"""Get the current posterior for a (user, category) pair.
Args:
user_id: User identifier.
category: Content category.
Returns:
(alpha, beta) of the Beta posterior.
"""
if user_id in self.user_posteriors and category in self.user_posteriors[user_id]:
return self.user_posteriors[user_id][category]
return self.category_priors[category]
def update(self, user_id: str, category: str, engaged: bool) -> None:
"""Update posterior after a single interaction.
Args:
user_id: User identifier.
category: Content category.
engaged: Whether the user engaged with the item.
"""
if user_id not in self.user_posteriors:
self.user_posteriors[user_id] = {}
alpha, beta_param = self.get_posterior(user_id, category)
if engaged:
alpha += 1
else:
beta_param += 1
self.user_posteriors[user_id][category] = (alpha, beta_param)
def predict_mean(self, user_id: str, category: str) -> float:
"""Return the posterior mean (expected engagement probability).
Args:
user_id: User identifier.
category: Content category.
Returns:
Posterior mean engagement probability.
"""
alpha, beta_param = self.get_posterior(user_id, category)
return alpha / (alpha + beta_param)
def predict_uncertainty(self, user_id: str, category: str) -> float:
"""Return the posterior standard deviation (uncertainty).
Args:
user_id: User identifier.
category: Content category.
Returns:
Posterior standard deviation.
"""
alpha, beta_param = self.get_posterior(user_id, category)
total = alpha + beta_param
return np.sqrt(alpha * beta_param / (total**2 * (total + 1)))
def sample_thompson(self, user_id: str, categories: list) -> str:
"""Thompson sampling: select category by sampling from posteriors.
This naturally balances exploration (uncertain categories) with
exploitation (high-mean categories).
Args:
user_id: User identifier.
categories: List of candidate categories.
Returns:
Selected category name.
"""
best_category = None
best_sample = -1.0
for cat in categories:
alpha, beta_param = self.get_posterior(user_id, cat)
sample = np.random.beta(alpha, beta_param)
if sample > best_sample:
best_sample = sample
best_category = cat
return best_category
def rank_categories(self, user_id: str, categories: list) -> list:
"""Rank categories by posterior mean (greedy exploitation).
Args:
user_id: User identifier.
categories: List of candidate categories.
Returns:
List of (category, mean, uncertainty) tuples, sorted by mean descending.
"""
ranked = []
for cat in categories:
mean = self.predict_mean(user_id, cat)
unc = self.predict_uncertainty(user_id, cat)
ranked.append((cat, mean, unc))
ranked.sort(key=lambda x: x[1], reverse=True)
return ranked
# Build the model with population-level priors
# Prior calibration: estimate from overall category engagement rates
category_priors = {
"drama": (8.0, 12.0), # Population engagement ~40%
"comedy": (10.0, 10.0), # Population engagement ~50%
"documentary": (6.0, 14.0), # Population engagement ~30%
"action": (9.0, 11.0), # Population engagement ~45%
"sci_fi": (5.0, 15.0), # Population engagement ~25%
"horror": (4.0, 16.0), # Population engagement ~20%
}
model = UserPreferenceModel(category_priors=category_priors)
categories = list(category_priors.keys())
# Simulate a new user vs. an established user
rng = np.random.RandomState(42)
# New user: no history
print("=== New user (cold start) ===")
for cat, mean, unc in model.rank_categories("new_user", categories):
print(f" {cat:>15s}: mean={mean:.3f}, uncertainty={unc:.3f}")
# Established user: 200 interactions
# This user secretly loves sci-fi and hates drama
true_prefs = {
"drama": 0.15, "comedy": 0.50, "documentary": 0.40,
"action": 0.45, "sci_fi": 0.80, "horror": 0.10,
}
for _ in range(200):
cat = rng.choice(categories)
engaged = rng.rand() < true_prefs[cat]
model.update("user_42", cat, engaged)
print("\n=== Established user (200 interactions) ===")
for cat, mean, unc in model.rank_categories("user_42", categories):
print(f" {cat:>15s}: mean={mean:.3f}, uncertainty={unc:.3f}")
true = true_prefs[cat]
print(f" (true={true:.3f})")
=== New user (cold start) ===
comedy: mean=0.500, uncertainty=0.109
action: mean=0.450, uncertainty=0.108
drama: mean=0.400, uncertainty=0.107
documentary: mean=0.300, uncertainty=0.100
sci_fi: mean=0.250, uncertainty=0.094
horror: mean=0.200, uncertainty=0.087
=== Established user (200 interactions) ===
sci_fi: mean=0.720, uncertainty=0.058
comedy: mean=0.490, uncertainty=0.063
action: mean=0.432, uncertainty=0.062
documentary: mean=0.346, uncertainty=0.064
drama: mean=0.196, uncertainty=0.051
horror: mean=0.159, uncertainty=0.053
(true=0.100)
Three key observations:
- Cold-start users get the population average — a sensible default with high uncertainty.
- Established users have posteriors that reflect their personal behavior. Sci-fi jumped from rank 5 (population prior: 25%) to rank 1 (posterior: 72%), matching this user's true preference of 80%.
- Uncertainty is much higher for the new user (~0.10) than the established user (~0.06). This uncertainty can drive exploration: Thompson sampling will explore more categories for uncertain users and exploit known preferences for established users.
20.12 The Pharma Application: Bayesian Treatment Effect Reasoning
MediCore Pharmaceuticals is evaluating Drug X for reducing blood pressure in patients with mild hypertension. They have two sources of information:
- Clinical knowledge (prior): Three previous studies of similar ACE inhibitors suggest a blood pressure reduction of 8-12 mmHg, with a pooled estimate of 10 mmHg and standard error of 2 mmHg.
- Observational data (likelihood): An observational study of 150 patients who received Drug X shows a mean reduction of 7.2 mmHg with a standard error of 1.5 mmHg.
Using Normal-Normal conjugacy:
# MediCore: Bayesian treatment effect estimation
prior_mu = 10.0 # Prior mean from meta-analysis (mmHg reduction)
prior_sigma = 2.0 # Prior uncertainty
data_mean = 7.2 # Observational study mean
data_se = 1.5 # Standard error (sigma / sqrt(n))
# Normal-Normal conjugate update
prior_precision = 1 / prior_sigma**2
data_precision = 1 / data_se**2
post_precision = prior_precision + data_precision
post_sigma = 1 / np.sqrt(post_precision)
post_mu = (prior_precision * prior_mu + data_precision * data_mean) / post_precision
print("MediCore Drug X — Bayesian Treatment Effect Analysis")
print("=" * 55)
print(f"Prior (meta-analysis): {prior_mu:.1f} ± {prior_sigma:.1f} mmHg")
print(f"Data (observational): {data_mean:.1f} ± {data_se:.1f} mmHg")
print(f"Posterior: {post_mu:.2f} ± {post_sigma:.2f} mmHg")
print(f"95% credible interval: [{post_mu - 1.96*post_sigma:.2f}, "
f"{post_mu + 1.96*post_sigma:.2f}]")
# Key clinical question: P(effect > 5 mmHg | data)?
prob_clinically_significant = 1 - stats.norm.cdf(5.0, loc=post_mu, scale=post_sigma)
print(f"\nP(reduction > 5 mmHg | data) = {prob_clinically_significant:.3f}")
# What if we had no prior? (flat prior = MLE)
print(f"\nFrequentist estimate: {data_mean:.1f} ± {data_se:.1f} mmHg")
print(f"Frequentist 95% CI: [{data_mean - 1.96*data_se:.2f}, "
f"{data_mean + 1.96*data_se:.2f}]")
prob_freq = 1 - stats.norm.cdf(5.0, loc=data_mean, scale=data_se)
print(f"P-value for H0: effect <= 5: {1 - prob_freq:.4f} (frequentist)")
# Visualize
theta = np.linspace(0, 20, 500)
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(theta, stats.norm.pdf(theta, prior_mu, prior_sigma), "--",
label=f"Prior: N({prior_mu}, {prior_sigma}²)", linewidth=2)
ax.plot(theta, stats.norm.pdf(theta, data_mean, data_se), ":",
label=f"Likelihood: N({data_mean}, {data_se}²)", linewidth=2)
ax.plot(theta, stats.norm.pdf(theta, post_mu, post_sigma), "-",
label=f"Posterior: N({post_mu:.1f}, {post_sigma:.2f}²)", linewidth=2)
ax.axvline(5.0, color="gray", linestyle="-.", alpha=0.5, label="Clinical threshold (5 mmHg)")
ax.fill_between(theta, stats.norm.pdf(theta, post_mu, post_sigma),
where=theta > 5.0, alpha=0.2, color="green")
ax.set_xlabel("Blood pressure reduction (mmHg)")
ax.set_ylabel("Density")
ax.set_title("MediCore Drug X: Bayesian treatment effect estimation")
ax.legend()
plt.tight_layout()
plt.show()
MediCore Drug X — Bayesian Treatment Effect Analysis
=======================================================
Prior (meta-analysis): 10.0 ± 2.0 mmHg
Data (observational): 7.2 ± 1.5 mmHg
Posterior: 8.23 ± 1.20 mmHg
95% credible interval: [5.88, 10.58]
P(reduction > 5 mmHg | data) = 0.996
Frequentist estimate: 7.2 ± 1.5 mmHg
Frequentist 95% CI: [4.26, 10.14]
P-value for H0: effect <= 5: 0.0713 (frequentist)
This illustrates a case where the Bayesian and frequentist analyses lead to different practical conclusions. The frequentist 95% CI includes values below 5 mmHg (the clinical significance threshold), and the one-sided p-value of 0.07 fails to reject at $\alpha = 0.05$. The Bayesian analysis, incorporating the meta-analytic prior, produces a tighter credible interval entirely above 5 mmHg, and the posterior probability of clinical significance is 99.6%.
Neither analysis is "right" in isolation. The frequentist analysis is conservative: it refuses to use prior knowledge, which means it requires more data. The Bayesian analysis is informative: it incorporates prior knowledge, which means its conclusion is only as trustworthy as the prior. If the previous studies used a different drug mechanism or different patient population, the prior may be misleading. This is why prior sensitivity analysis (Section 20.6) matters.
20.13 Looking Ahead: When Conjugacy Fails
This chapter covered the cases where Bayesian inference has closed-form solutions. Most real-world models are not so cooperative. Logistic regression, hierarchical models with many groups, Gaussian mixture models, neural networks — none of these have conjugate posteriors.
Chapter 21 picks up exactly where this chapter ends: how to compute posteriors when the math does not work out. The tools are:
- Markov chain Monte Carlo (MCMC): Sample from the posterior by constructing a Markov chain whose stationary distribution is the posterior. Modern algorithms (Hamiltonian Monte Carlo, NUTS) are efficient enough for models with thousands of parameters.
- Variational inference: Approximate the posterior with a simpler distribution by minimizing the KL divergence. Faster than MCMC but introduces approximation error.
- Hierarchical models: The killer application of Bayesian inference — estimating parameters for many groups simultaneously, with partial pooling through shared priors.
The mathematical and conceptual foundation from this chapter — priors, likelihoods, posteriors, conjugacy, the MAP connection, sequential updating — carries through unchanged. The only difference is that the integral $\int p(D \mid \theta) p(\theta) \, d\theta$ must be approximated rather than computed in closed form.
Advanced Sidebar: The Bernstein-von Mises theorem provides conditions under which the Bayesian posterior converges to a Normal distribution centered at the MLE with variance equal to the inverse Fisher information, as $n \to \infty$. This is the "Bayesian CLT" — it guarantees that, for well-specified regular parametric models, the frequentist and Bayesian approaches agree asymptotically. The conditions include: correct model specification, compact parameter space, non-degenerate Fisher information, and a prior that is positive and continuous at the true parameter value. When these conditions fail (misspecified models, non-regular problems, high-dimensional settings), the posterior can diverge from the frequentist result, and understanding why requires the tools of Chapter 34 (Uncertainty Quantification).
Summary
Bayesian inference is not an alternative to frequentist statistics — it is a complementary framework that answers different questions. Bayes' theorem provides the optimal way to update beliefs with evidence: posterior $\propto$ likelihood $\times$ prior. For conjugate models (Beta-Binomial, Normal-Normal, and the exponential family more broadly), the posterior has a closed form that updates by simple arithmetic on sufficient statistics.
The MAP-MLE-regularization triangle reveals that the Bayesian framework subsumes familiar tools: MLE is MAP with a flat prior, L2 regularization is MAP with a Gaussian prior, and L1 regularization is MAP with a Laplace prior. Understanding this connection changes how you think about both regularization and prior selection.
The practical framework for prior selection — uninformative, weakly informative, and informative priors, validated by prior predictive checks and prior sensitivity analysis — removes the mysticism from prior choice. Prior selection is an engineering decision, not a philosophical commitment.
Bayesian inference adds the most value when data is scarce, prior knowledge is genuine, uncertainty quantification matters, the problem has hierarchical structure, or decisions must be made sequentially. It adds the least value when data is abundant, only predictions are needed, computational budgets are tight, and priors offer no advantage.
The StreamRec application demonstrates the practical payoff: a per-user preference model that handles cold start through high prior uncertainty, converges to personal preferences as data accumulates, and supports exploration through Thompson sampling — all in microseconds per update, all with a closed-form conjugate model.
The tools in this chapter are exact but limited to conjugate models. Chapter 21 removes the conjugacy requirement by introducing MCMC and variational inference, making the full power of Bayesian inference available for arbitrary models.
Chapter 20 Notation Reference
| Symbol | Meaning |
|---|---|
| $\theta$ | Parameter(s) of interest |
| $D$ | Observed data |
| $p(\theta)$ | Prior distribution |
| $p(D \mid \theta)$ | Likelihood |
| $p(\theta \mid D)$ | Posterior distribution |
| $p(D)$ | Marginal likelihood (evidence) |
| $\text{Beta}(\alpha, \beta)$ | Beta distribution with parameters $\alpha, \beta$ |
| $\mathcal{N}(\mu, \sigma^2)$ | Normal distribution with mean $\mu$, variance $\sigma^2$ |
| $\hat{\theta}_{\text{MLE}}$ | Maximum likelihood estimate |
| $\hat{\theta}_{\text{MAP}}$ | Maximum a posteriori estimate |
| $\text{BF}_{12}$ | Bayes factor comparing model 1 to model 2 |
| $\text{HPDI}$ | Highest posterior density interval |