Chapter 10: Exercises

These exercises progress from foundational probability manipulation to implementing Bayesian algorithms from scratch. Solutions for the programming exercises are available in code/exercise-solutions.py.

Foundational Exercises

Exercise 10.1: Bayesian Update by Hand

A factory produces widgets that are either defective (D) or non-defective (ND). Historical data suggests $P(D) = 0.02$. A quality-control test has a true positive rate of $P(+ \mid D) = 0.95$ and a false positive rate of $P(+ \mid ND) = 0.03$.

(a) A randomly selected widget tests positive. What is the posterior probability that it is defective?

(b) Suppose we test the widget again (independently), and it tests positive a second time. Using the posterior from part (a) as the new prior, compute the updated probability of defect.

(c) How many consecutive positive tests would be needed for the posterior probability of defect to exceed 0.99?

Exercise 10.2: Beta-Binomial Conjugacy

You are estimating the click-through rate (CTR) of an online ad. Your prior belief is that the CTR is around 5% with moderate uncertainty, so you choose a Beta(5, 95) prior.

(a) What are the prior mean and prior standard deviation?

(b) After showing the ad 200 times, you observe 18 clicks. Compute the posterior distribution and its mean and standard deviation.

(c) Compute the 95% credible interval for the CTR.

(d) A colleague argues for a Beta(1, 1) prior instead. How does this change the posterior mean? At what sample size do the two priors yield approximately the same posterior?

Exercise 10.3: MAP vs. MLE

For the coin-flip model with $h$ heads in $n$ flips and a Beta($\alpha$, $\beta$) prior:

(a) Derive the Maximum Likelihood Estimate (MLE) for $\theta$.

(b) Derive the Maximum A Posteriori (MAP) estimate for $\theta$.

(c) Show that the MAP estimate reduces to the MLE when $\alpha = \beta = 1$.

(d) For what values of $\alpha$ and $\beta$ does the MAP estimate equal the posterior mean?

Exercise 10.4: Conjugate Prior Derivation

Starting from the Poisson likelihood $p(x \mid \lambda) = \frac{\lambda^x e^{-\lambda}}{x!}$ and the Gamma prior $p(\lambda) = \text{Gamma}(\alpha, \beta)$:

(a) Derive the posterior distribution $p(\lambda \mid x_1, \ldots, x_n)$.

(b) Show that the posterior is also a Gamma distribution and identify the updated parameters.

(c) A call center receives calls with an unknown rate $\lambda$ per hour. With a Gamma(2, 1) prior, you observe 15 calls in 3 hours. What is the posterior distribution? What is the posterior mean for the hourly rate?

Exercise 10.5: Normal-Normal Conjugacy

A sensor measures temperature with known measurement noise $\sigma^2 = 4$. Your prior for the true temperature is $\mathcal{N}(20, 25)$ (in Celsius).

(a) After a single measurement of $x = 23$, compute the posterior distribution for the true temperature.

(b) After 10 measurements with sample mean $\bar{x} = 22.5$, compute the posterior.

(c) How many measurements would be needed for the posterior standard deviation to be less than 0.5?

Intermediate Exercises

Exercise 10.6: Naive Bayes from Scratch

Implement a Gaussian Naive Bayes classifier from scratch (without using scikit-learn's implementation).

(a) Write a function that estimates class priors and per-class feature means and variances from training data.

(b) Write a prediction function that computes log posteriors for each class and returns the predicted class.

(c) Test your implementation on the Iris dataset and compare accuracy with sklearn.naive_bayes.GaussianNB.

(d) Implement Laplace smoothing for the categorical case and test on a simple text classification problem.

Exercise 10.7: Bayesian Linear Regression Implementation

Using the equations from Section 10.3:

(a) Generate synthetic data from $y = 2x + 1 + \epsilon$ where $\epsilon \sim \mathcal{N}(0, 0.5^2)$ with 20 data points.

(b) Implement Bayesian linear regression with a $\mathcal{N}(\mathbf{0}, 5^2 \mathbf{I})$ prior and known noise variance $\sigma^2 = 0.25$.

(c) Plot the posterior predictive mean and 95% credible band. Also plot samples from the posterior over weights to show plausible regression lines.

(d) Repeat with 5 data points and 200 data points. How does the width of the credible band change?

Exercise 10.8: Prior Predictive Checks

For a simple linear regression model $y = \beta_0 + \beta_1 x + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \sigma^2)$:

(a) Place priors $\beta_0 \sim \mathcal{N}(0, 100)$, $\beta_1 \sim \mathcal{N}(0, 100)$, $\sigma \sim \text{HalfNormal}(50)$. Generate 100 prior predictive datasets for $x \in [-5, 5]$ and plot them. Do the predictions look reasonable?

(b) Now use $\beta_0 \sim \mathcal{N}(0, 5)$, $\beta_1 \sim \mathcal{N}(0, 2)$, $\sigma \sim \text{HalfNormal}(1)$. Repeat the prior predictive check. How does the range of plausible outcomes change?

(c) Explain why prior predictive checks are important even when using "non-informative" priors.

Exercise 10.9: Metropolis-Hastings Sampler

(a) Implement the Metropolis-Hastings algorithm to sample from a mixture of two Gaussians: $p(x) = 0.3 \cdot \mathcal{N}(-3, 1) + 0.7 \cdot \mathcal{N}(3, 0.5^2)$.

(b) Run the sampler with proposal standard deviations of 0.1, 1.0, and 10.0. Plot trace plots and histograms for each. Which mixes best?

(c) Compute the acceptance rate for each proposal scale. What is the theoretical optimal acceptance rate for a 1D target?

(d) Implement a simple adaptive scheme that adjusts the proposal standard deviation during burn-in to target a 44% acceptance rate (optimal for 1D).

Exercise 10.10: MCMC Diagnostics

Using the samples from Exercise 10.9:

(a) Implement and plot the autocorrelation function for each chain. Estimate the effective sample size.

(b) Run 4 independent chains from different starting points. Compute the Gelman-Rubin $\hat{R}$ statistic.

(c) Determine the minimum burn-in period by examining when the running mean stabilizes.

(d) What is the relationship between effective sample size and the integrated autocorrelation time?

Exercise 10.11: Variational Inference for a Gaussian

Consider the problem of estimating the mean $\mu$ of a Gaussian with known variance $\sigma^2 = 1$, given data $x_1, \ldots, x_n$.

(a) Write down the ELBO for a variational family $q(\mu) = \mathcal{N}(m, s^2)$.

(b) Derive the optimal variational parameters $m^*$ and $s^{*2}$ by maximizing the ELBO analytically.

(c) Compare the variational posterior with the exact posterior (which you derived in Exercise 10.5). Are they the same? Why or why not?

Exercise 10.12: Gaussian Process Regression

(a) Implement GP regression from scratch using the RBF kernel. Generate data from $f(x) = \sin(x)$ with noise $\sigma_n^2 = 0.1$ for 10 training points in $[-5, 5]$.

(b) Plot the GP posterior mean and 95% confidence band. How does uncertainty vary across the input space?

(c) Vary the length scale $\ell$ in $\{0.1, 0.5, 1.0, 3.0\}$ and observe how the fit changes. What happens with very small or very large length scales?

(d) Implement log marginal likelihood computation and use scipy.optimize.minimize to optimize the kernel hyperparameters.

Exercise 10.13: Kernel Composition

(a) Create a dataset that combines a linear trend with periodic oscillations: $f(x) = 0.5x + 2\sin(x)$.

(b) Fit a GP with a single RBF kernel. Does it capture both the trend and the periodicity?

(c) Now fit with a composite kernel: Linear + Periodic. Show that this better captures the data structure.

(d) Explain intuitively why summing kernels corresponds to summing independent function components.

Advanced Exercises

Exercise 10.14: Bayesian Model Comparison

Consider two models for coin flip data ($h$ heads in $n$ flips):

$M_1$: Fair coin, $\theta = 0.5$
$M_2$: Unknown bias, $\theta \sim \text{Beta}(1, 1)$

(a) Derive the marginal likelihood $p(\mathcal{D} \mid M_1)$ for model $M_1$.

(b) Derive the marginal likelihood $p(\mathcal{D} \mid M_2)$ for model $M_2$ (hint: use the Beta-Binomial integral).

(c) Compute the Bayes factor $\text{BF}_{12} = p(\mathcal{D} \mid M_1) / p(\mathcal{D} \mid M_2)$ for $n = 20, h = 10$ and for $n = 20, h = 15$.

(d) Interpret the Bayes factors using the Jeffreys scale. What does each result suggest about the fairness of the coin?

Exercise 10.15: Bayesian Logistic Regression with MCMC

(a) Generate a 2D binary classification dataset using sklearn.datasets.make_classification.

(b) Define a logistic regression model with Gaussian priors on the weights: $w_j \sim \mathcal{N}(0, 10)$.

(c) Implement the Metropolis-Hastings algorithm to sample from the posterior over weights.

(d) For each posterior sample, compute the decision boundary. Plot several sampled decision boundaries overlaid on the data. How does the set of boundaries reflect classification uncertainty?

Exercise 10.16: Empirical Bayes

In the normal means problem, we observe $x_i \sim \mathcal{N}(\theta_i, 1)$ for $i = 1, \ldots, n$, where $\theta_i \sim \mathcal{N}(\mu, \tau^2)$.

(a) Derive the marginal distribution $p(x_i \mid \mu, \tau^2)$.

(b) Write down the log-likelihood for the hyperparameters $(\mu, \tau^2)$ and derive the maximum marginal likelihood estimates.

(c) Given the estimated hyperparameters, derive the posterior $p(\theta_i \mid x_i, \hat{\mu}, \hat{\tau}^2)$ and show it shrinks $x_i$ toward $\hat{\mu}$.

(d) Implement this on simulated data with $n = 50$ and compare the shrinkage estimates with the raw observations. Compute the MSE improvement.

Exercise 10.17: Sparse GP Approximation

(a) Explain why exact GP regression scales as $\mathcal{O}(n^3)$. What are the bottleneck operations?

(b) Implement a simple inducing-point approximation (Nystrom method) using $m = 20$ inducing points for a dataset of $n = 500$ points.

(c) Compare the predictive distribution of the sparse GP with the exact GP. Plot both.

(d) Time both methods and report the speedup.

Exercise 10.18: Thompson Sampling

Thompson Sampling is a Bayesian approach to the multi-armed bandit problem (a sequential decision-making problem where an agent repeatedly chooses between actions with uncertain rewards).

(a) Explain Thompson Sampling for a Bernoulli bandit with $K$ arms: sample from each arm's posterior, play the arm with the highest sample, and update.

(b) Implement Thompson Sampling for a 3-arm bandit with true rates $(0.3, 0.5, 0.7)$ using Beta-Bernoulli conjugacy.

(c) Run for 1000 rounds and plot the cumulative regret.

(d) Compare with an epsilon-greedy strategy ($\epsilon = 0.1$). Which achieves lower regret?

Exercise 10.19: Bayesian Neural Network (Conceptual)

(a) Explain how placing priors on neural network weights turns a standard neural network into a Bayesian neural network (BNN).

(b) What is the "Bayes by Backprop" algorithm (Blundell et al., 2015)? Describe the variational family and the objective.

(c) Why is the mean-field assumption particularly limiting for BNNs? What correlations does it miss?

(d) Describe MC Dropout (Gal and Ghahramani, 2016) as an approximate variational inference method. How does it produce uncertainty estimates at test time?

Exercise 10.20: Posterior Predictive Checks

(a) Fit a Poisson regression model to count data: $y_i \sim \text{Poisson}(\exp(\beta_0 + \beta_1 x_i))$. Use a simple dataset where $x$ is a single feature.

(b) Generate 500 datasets from the posterior predictive distribution by: sampling weights from the posterior, then sampling data from the likelihood.

(c) Compare the distribution of the maximum value, the mean, and the variance in the replicated datasets with the observed data. Are there systematic discrepancies?

(d) If the Poisson model is misspecified (e.g., the true data has overdispersion), what patterns would you expect in the posterior predictive checks?

Exercise 10.21: Information-Theoretic Prior Analysis

(a) Define the Jeffreys prior for the Bernoulli model $p(x \mid \theta) = \theta^x (1-\theta)^{1-x}$. Show that it is $\text{Beta}(1/2, 1/2)$.

(b) Define the Jeffreys prior for the normal distribution with known variance. Show that it is an improper uniform prior on $\mu$.

(c) Explain the invariance property of Jeffreys priors under reparameterization. Why is this desirable?

(d) Give an example where the Jeffreys prior leads to an improper posterior. Why is this problematic?

Exercise 10.22: Bayesian Optimization (Application)

Bayesian optimization uses GPs to optimize expensive black-box functions.

(a) Explain the expected improvement (EI) acquisition function. Why does it balance exploration and exploitation?

(b) Implement a simple 1D Bayesian optimization loop using a GP surrogate model and the EI acquisition function.

(c) Apply it to minimize $f(x) = -(x \sin(x) + x \cos(2x))$ on $[-5, 5]$. Start with 3 random evaluations and run for 15 iterations.

(d) Plot the GP surrogate, the EI function, and the next query point at each iteration.

Exercise 10.23: Gibbs Sampling

(a) Derive the full conditional distributions for a bivariate normal distribution $\mathcal{N}\!\left(\begin{pmatrix}\mu_1 \\ \mu_2\end{pmatrix}, \begin{pmatrix}\sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2\end{pmatrix}\right)$.

(b) Implement a Gibbs sampler for the bivariate normal.

(c) Show that the Gibbs sampler mixes slowly when $\rho$ is close to 1 or -1. Plot trace plots for $\rho = 0.0, 0.9, 0.99$.

(d) Explain why HMC does not suffer from this slow mixing. What information does HMC use that Gibbs does not?

Exercise 10.24: Conjugate Prior Practice

For each of the following scenarios, identify the appropriate likelihood, select a conjugate prior, and compute the posterior:

(a) A website records the number of visitors per hour. In 8 hours, you observe: 42, 55, 38, 61, 49, 53, 44, 57.

(b) A medical test is administered to 200 patients. 35 test positive. Estimate the positive rate.

(c) You measure the heights of 15 students (in cm): 168, 172, 165, 175, 170, 177, 163, 180, 171, 169, 174, 167, 178, 173, 166. Assume known variance $\sigma^2 = 25$.

(d) An experiment measures waiting times (in minutes) between events: 2.3, 4.1, 1.7, 3.5, 5.2, 2.8, 3.9, 1.5. Assume exponential model.

Exercise 10.25: Comparison of Inference Methods

For the Bayesian logistic regression model from Exercise 10.15:

(a) Compute the posterior using the Laplace approximation (Gaussian centered at the MAP estimate with covariance given by the inverse Hessian).

(b) Implement mean-field variational inference with Gaussian variational distributions for each weight.

(c) Compare the posterior approximations from (a) and (b) with the MCMC posterior from Exercise 10.15 (which serves as ground truth). Plot marginal distributions for each weight.

(d) Discuss the trade-offs: accuracy, computation time, ease of implementation.

Exercise 10.26: Sequential Bayesian Learning

(a) Implement sequential Bayesian updates for the Beta-Binomial model, processing one observation at a time.

(b) Show that the result after processing all data sequentially is identical to processing the data in one batch.

(c) Plot the posterior evolution after each observation for a dataset of 50 coin flips with $\theta = 0.7$.

(d) Implement a simple change-point detection scheme: if the posterior predictive probability of the next observation drops below a threshold, reset the prior. Test with data where the true $\theta$ changes from 0.3 to 0.7 at observation 25.

Exercise 10.27: Gaussian Process Classification

(a) Explain why GP classification is more complex than GP regression (hint: the likelihood is non-Gaussian).

(b) Use sklearn.gaussian_process.GaussianProcessClassifier to classify a 2D dataset generated with make_moons.

(c) Plot the predictive probability surface. How does the GP express uncertainty at the decision boundary versus far from training data?

(d) Compare the GP classifier's uncertainty with that of a logistic regression model and a random forest.

Exercise 10.28: Hamiltonian Monte Carlo (Conceptual)

(a) Explain the physical analogy behind HMC: the parameter as a "position" and the auxiliary momentum variable.

(b) Why does HMC produce proposals that are far from the current state yet have high acceptance probability?

(c) What are the two tuning parameters of HMC? What happens when each is misspecified?

(d) How does NUTS (No-U-Turn Sampler) eliminate one of these tuning parameters?

Exercise 10.29: Bayesian Decision Theory

A medical test costs $50 to administer. A disease has prevalence 1%. The test has sensitivity 90% and specificity 95%. Treatment costs $5,000 but cures the disease. Untreated disease costs $100,000 in long-term care.

(a) Set up the decision problem with a loss function.

(b) Using Bayesian decision theory (minimize expected posterior loss), determine whether to treat a patient who tests positive.

(c) What if the patient tests negative? Should we consider treatment?

(d) At what prevalence rate does the optimal decision change for a positive test result?

Exercise 10.30: Dirichlet-Multinomial Model

You are analyzing the composition of a text corpus across 4 topics. Your Dirichlet prior is Dir(2, 2, 2, 2).

(a) Generate samples from the prior and plot the 3-simplex.

(b) After observing document topic assignments [Topic 1: 30, Topic 2: 50, Topic 3: 15, Topic 4: 5], compute the posterior.

(c) Compute the posterior mean and 95% credible interval for each topic proportion.

(d) What is the posterior probability that Topic 2 is the most prevalent topic?

Exercise 10.31: Predictive Distribution

(a) For the Beta-Binomial model with posterior Beta($\alpha$, $\beta$), derive the posterior predictive distribution for a single new observation $\tilde{x} \in \{0, 1\}$.

(b) For the Normal-Normal model with posterior $\mathcal{N}(\mu_n, \sigma_n^2)$ and known data variance $\sigma^2$, derive the posterior predictive for a new observation.

(c) Show that the predictive distribution has greater variance than the likelihood alone. Explain why in terms of epistemic uncertainty.

Exercise 10.32: Robust Bayesian Methods

(a) Explain why a Gaussian likelihood is sensitive to outliers. What happens to the posterior mean when one data point is very far from the rest?

(b) Replace the Gaussian likelihood with a Student-t likelihood (heavier tails). Using MCMC, show that the posterior is more robust to outliers.

(c) Generate a dataset with 50 points from $\mathcal{N}(0, 1)$ plus 3 outliers at $x = 10$. Compare the posterior for $\mu$ under Gaussian vs. Student-t likelihoods.

Exercise 10.33: Multi-Output Gaussian Processes

(a) Explain how coregionalization extends GPs to multiple correlated outputs.

(b) Generate synthetic data for two correlated outputs: $f_1(x) = \sin(x)$ and $f_2(x) = \cos(x)$.

(c) Using scikit-learn, fit independent GPs to each output. Then, describe (conceptually) how a multi-output GP could share information between outputs.

(d) When would multi-output GPs be advantageous over independent GPs?

Exercise 10.34: Approximate Bayesian Computation (Conceptual)

(a) Explain when standard Bayesian inference is infeasible (hint: intractable likelihood).

(b) Describe the basic ABC rejection algorithm. What is the role of the tolerance $\epsilon$ and summary statistics?

(c) Give an example of a model where the likelihood is intractable but data can be simulated.

(d) What are the limitations of ABC? How does the curse of dimensionality affect it?

Exercise 10.35: Calibration of Probabilistic Predictions

(a) Define what it means for a probabilistic classifier to be "well-calibrated."

(b) Train a Gaussian Naive Bayes and a Bayesian logistic regression on a dataset of your choice. Plot calibration curves for both models.

(c) Compute the Expected Calibration Error (ECE) for each model.

(d) Apply Platt scaling or isotonic regression to improve calibration. Report the change in ECE.