Chapter 20: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field


Foundational Computations

Exercise 20.1 (*)

A website runs an A/B test. In the control group, 45 out of 300 visitors convert. In the treatment group, 62 out of 310 visitors convert.

(a) Using a Beta(1, 1) prior for both groups, compute the posterior distributions for the conversion rates $\theta_{\text{control}}$ and $\theta_{\text{treatment}}$.

(b) Using Monte Carlo sampling (draw 100,000 samples from each posterior), estimate $P(\theta_{\text{treatment}} > \theta_{\text{control}} \mid D)$.

(c) Compute the posterior distribution of the lift: $\delta = \theta_{\text{treatment}} - \theta_{\text{control}}$. Report the posterior mean and 95% HPDI of $\delta$.


Exercise 20.2 (*)

A manufacturing process produces widgets. Historical quality control records indicate that approximately 2% of widgets are defective, with a standard deviation of 0.5%. A new batch of 50 widgets contains 3 defective items.

(a) Encode the historical knowledge as a Beta prior. (Hint: solve for $\alpha, \beta$ such that the mean is 0.02 and the variance matches.)

(b) Compute the posterior distribution.

(c) What is the posterior probability that the defect rate exceeds 5%?

(d) Compare your posterior mean with the MLE. Why do they differ?


Exercise 20.3 (*)

For the Normal-Normal conjugate model (known $\sigma^2$), verify the following by direct substitution:

(a) When $\sigma_0 \to \infty$ (flat prior), the posterior mean equals $\bar{x}$ (the MLE).

(b) When $n \to \infty$ (infinite data), the posterior mean converges to $\bar{x}$ regardless of the prior.

(c) The posterior variance is always less than both the prior variance $\sigma_0^2$ and the likelihood variance $\sigma^2/n$. (Hint: use the precision form.)


Exercise 20.4 (*)

Compute the posterior for the Poisson-Gamma conjugate model.

Setup: Observations $x_1, \ldots, x_n \overset{\text{iid}}{\sim} \text{Poisson}(\lambda)$, prior $\lambda \sim \text{Gamma}(\alpha, \beta)$.

(a) Write the likelihood $p(x_1, \ldots, x_n \mid \lambda)$.

(b) Multiply by the Gamma prior and identify the posterior as another Gamma distribution. State the posterior parameters in terms of $\alpha, \beta, n$, and $\sum x_i$.

(c) A call center receives an average of 12 calls per hour based on 6 months of data ($\alpha = 360, \beta = 30$). During a product launch, 18 calls arrive in the first hour. What is the posterior mean for $\lambda$?


Exercise 20.5 (*)

Consider a Dirichlet-Multinomial model for topic classification. A document can belong to one of 4 topics. The prior is $\text{Dirichlet}(2, 2, 2, 2)$.

(a) What is the prior expected probability for each topic?

(b) After classifying 100 documents — 40 topic A, 25 topic B, 20 topic C, 15 topic D — what is the posterior?

(c) Compute the posterior expected probability for each topic and the 95% marginal credible intervals (each marginal is a Beta distribution — derive this from the Dirichlet).


Conceptual Understanding

Exercise 20.6 (**)

The MAP estimate is not invariant under reparameterization, but the posterior mean is. Demonstrate this concretely.

(a) For the Beta(8, 4) posterior from the coin example, compute the MAP estimate of $\theta$ and the posterior mean of $\theta$.

(b) Reparameterize: let $\phi = \log\frac{\theta}{1 - \theta}$ (the log-odds). Use the change-of-variables formula to derive $p(\phi \mid D)$. Compute the MAP estimate of $\phi$.

(c) Verify that the MAP of $\phi$ is NOT equal to $\log\frac{\hat{\theta}_{\text{MAP}}}{1 - \hat{\theta}_{\text{MAP}}}$. (This is why MAP is "not really Bayesian.")

(d) Verify that $\mathbb{E}[\phi \mid D]$ (computed by numerical integration or sampling) IS consistent with the posterior mean interpretation under reparameterization.


Exercise 20.7 (**)

The MAP-MLE-regularization triangle in practice.

Using the logistic regression formulation from Section 20.5:

(a) Generate a dataset with 30 samples, 15 features, and 5 truly nonzero coefficients (the rest are zero). Fit three models: MLE (no regularization), ridge (L2), and lasso (L1).

(b) Plot the coefficient paths as a function of the regularization strength $\lambda$ (use 50 values of $\lambda$ from $10^{-3}$ to $10^{3}$ on a log scale).

(c) For each $\lambda$, compute the equivalent prior variance $\tau^2 = 1/\lambda$ for the Gaussian prior and $b = 1/\lambda$ for the Laplace prior. Label the coefficient paths with both the regularization strength and the prior scale.

(d) At what value of $\lambda$ does the lasso first produce a completely sparse solution (all coefficients zero)?


Exercise 20.8 (**)

Sequential updating equivalence.

(a) Prove algebraically that for the Beta-Binomial model, updating sequentially (one observation at a time) produces the same posterior as batch updating (all observations at once). Start with Beta($\alpha, \beta$) prior, observe $x_1 \in \{0, 1\}$, update, then observe $x_2 \in \{0, 1\}$, update. Show the result equals Beta($\alpha + x_1 + x_2, \beta + 2 - x_1 - x_2$).

(b) Implement this in Python: generate 100 Bernoulli observations. Process them one at a time (sequential) and all at once (batch). Verify that the posterior parameters are identical (up to floating-point precision).

(c) Does sequential updating equivalence hold for the Normal-Normal model? Prove it or provide a counterexample.


Exercise 20.9 (**)

Jeffreys prior for the Bernoulli model.

(a) Compute the Fisher information $I(\theta) = -\mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} \log p(x \mid \theta)\right]$ for a single Bernoulli observation.

(b) Show that the Jeffreys prior is $p(\theta) \propto \theta^{-1/2}(1 - \theta)^{-1/2}$, which is Beta(1/2, 1/2).

(c) Plot Beta(1/2, 1/2) alongside Beta(1, 1) (uniform). How does the Jeffreys prior differ? What values of $\theta$ does it emphasize?

(d) With $k = 0$ successes in $n = 5$ trials, compare the posteriors under Beta(1, 1) and Beta(1/2, 1/2). Which gives a posterior mean closer to 0? Why might this matter in practice?


Exercise 20.10 (**)

Credible intervals vs. confidence intervals: a simulation study.

(a) Set the true parameter $\theta = 0.3$. Simulate 10,000 experiments, each with $n = 20$ Bernoulli trials. For each experiment, compute: - The 95% Wald confidence interval - The 95% Bayesian credible interval (equal-tailed, under Beta(1,1) prior)

(b) Compute the empirical coverage of each interval type: what fraction of the 10,000 intervals contain $\theta = 0.3$?

(c) Compute the average width of each interval type.

(d) Explain why the coverage and width differ (or agree). Under what conditions would they diverge more?


Implementation Challenges

Exercise 20.11 (**)

Bayesian A/B testing dashboard.

Build a function that takes two groups' conversion data and produces a complete Bayesian A/B test report:

  • Posterior distributions for both groups (plot)
  • Posterior for the difference $\delta = \theta_B - \theta_A$ (plot)
  • Posterior for the relative lift $(\theta_B - \theta_A) / \theta_A$ (plot)
  • $P(\theta_B > \theta_A \mid D)$
  • Expected loss of choosing B when A is better: $\mathbb{E}[\max(\theta_A - \theta_B, 0) \mid D]$
  • Expected loss of choosing A when B is better: $\mathbb{E}[\max(\theta_B - \theta_A, 0) \mid D]$

Test with: Group A has 120 conversions out of 1,000 visitors; Group B has 145 conversions out of 1,050 visitors. Use Beta(1, 1) priors.


Exercise 20.12 (**)

Multi-armed bandit with Beta-Bernoulli arms.

Implement a Thompson sampling agent for a 5-armed bandit problem. Each arm has an unknown success rate drawn from [0.1, 0.2, 0.35, 0.5, 0.7].

(a) Implement the Thompson sampling algorithm using Beta posteriors. Initialize with Beta(1, 1) priors for all arms.

(b) Run for 1,000 rounds. Plot: - Cumulative regret over time - The posterior mean and 95% credible interval for each arm at every 100th step - The fraction of times each arm was selected

(c) Compare with epsilon-greedy ($\epsilon = 0.1$) and UCB1 algorithms. Which converges faster to the optimal arm?


Exercise 20.13 (***)

StreamRec user preference system at scale.

Extend the UserPreferenceModel from Section 20.11:

(a) Add empirical Bayes prior estimation: given a dataset of all users' interaction histories, estimate the category-level prior parameters $(\alpha_c, \beta_c)$ by matching the population-level mean and variance. Use the method of moments: if the population engagement rate for category $c$ has mean $\bar{p}_c$ and variance $s_c^2$, solve for $\alpha_c$ and $\beta_c$.

(b) Add a decay mechanism: recent interactions should weigh more than old ones. Implement this by discounting the posterior parameters toward the prior over time: after $\Delta t$ days with no interaction, set $\alpha \leftarrow \alpha_0 + \gamma^{\Delta t}(\alpha - \alpha_0)$ and similarly for $\beta$, where $\gamma \in (0, 1)$ is a decay rate.

(c) Simulate 1,000 users over 90 days with drifting preferences (each user's true preference for each category follows a slow random walk). Compare three recommendation strategies: - Greedy (always recommend the highest posterior mean) - Thompson sampling (sample from posteriors) - Thompson sampling with decay

Report total engagement and engagement in the final 30 days for each strategy.


Exercise 20.14 (***)

Normal-Normal model with unknown variance.

The Normal-Normal conjugate model in Section 20.4 assumed $\sigma^2$ is known. In practice, it is not.

(a) Derive the conjugate prior for $(\mu, \sigma^2)$ jointly. Show that the Normal-Inverse-Gamma distribution $\mu \mid \sigma^2 \sim \mathcal{N}(\mu_0, \sigma^2 / \kappa_0)$, $\sigma^2 \sim \text{Inv-Gamma}(\alpha_0, \beta_0)$ is conjugate to the Normal likelihood.

(b) Derive the posterior parameters after observing $n$ data points.

(c) Implement the posterior in Python. Generate 30 observations from $\mathcal{N}(5, 4)$ (i.e., $\mu = 5, \sigma^2 = 4$). Use a weakly informative prior: $\mu_0 = 0, \kappa_0 = 0.01, \alpha_0 = 0.01, \beta_0 = 0.01$. Compare the posterior for $\mu$ with the known-variance case.

(d) Show that when $\kappa_0 \to 0$ and $\alpha_0 \to 0$, the marginal posterior for $\mu$ follows a Student-$t$ distribution with $n - 1$ degrees of freedom, centered at $\bar{x}$ with scale $s / \sqrt{n}$ — recovering the frequentist $t$-test.


Exercise 20.15 (***)

Derive the posterior for the Dirichlet-Multinomial model.

(a) Write the Multinomial likelihood for $n$ observations with counts $\mathbf{k} = (k_1, \ldots, k_K)$ across $K$ categories.

(b) Write the Dirichlet prior density $p(\boldsymbol{\theta} \mid \boldsymbol{\alpha})$.

(c) Multiply and identify the posterior. Verify that the Dirichlet is conjugate to the Multinomial.

(d) Implement in Python. A language model uses a Dirichlet(1, 1, ..., 1) prior over a vocabulary of size $V = 1000$. After observing 500 words from a document, compute the posterior predictive probability of the most frequent word and the least frequent word. How does the prior (Laplace smoothing) affect the estimates compared to the MLE?


Prior Selection and Sensitivity

Exercise 20.16 (**)

Prior predictive simulation.

Consider a linear regression model: $y_i = \beta_0 + \beta_1 x_i + \epsilon_i$ where $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$.

Assume $x$ ranges from 0 to 10 (e.g., years of experience) and $y$ is salary in thousands of dollars.

(a) Under a "diffuse" prior $\beta_0, \beta_1 \overset{\text{iid}}{\sim} \mathcal{N}(0, 100^2)$ and $\sigma \sim \text{Half-Cauchy}(0, 50)$, simulate 100 regression lines from the prior predictive. Plot them. Are they reasonable?

(b) Under a weakly informative prior $\beta_0 \sim \mathcal{N}(50, 20^2)$, $\beta_1 \sim \mathcal{N}(5, 5^2)$, $\sigma \sim \text{Half-Cauchy}(0, 10)$, repeat the simulation. Are these lines more reasonable?

(c) What percentage of prior predictive samples under each prior produce at least one negative salary prediction in the range $x \in [0, 10]$? Which prior better encodes the constraint that salaries are positive?


Exercise 20.17 (**)

Prior sensitivity analysis for the pharma example.

Using the MediCore example from Section 20.12:

(a) Compute the posterior under five different priors: - Flat: $\mathcal{N}(0, 100^2)$ - Skeptical: $\mathcal{N}(0, 3^2)$ (prior centered on no effect) - Weakly informative: $\mathcal{N}(5, 5^2)$ - Informative (as in the chapter): $\mathcal{N}(10, 2^2)$ - Very informative: $\mathcal{N}(10, 0.5^2)$

(b) For each prior, report: posterior mean, posterior SD, 95% credible interval, and $P(\text{effect} > 5 \mid D)$.

(c) At what data sample size does the skeptical prior and the informative prior produce posteriors with overlapping 95% credible intervals? (Increase the simulated dataset size until they converge.)

(d) Write a one-paragraph recommendation: which prior would you present to a regulatory body, and why?


Exercise 20.18 (***)

Robust priors: Student-$t$ vs. Normal.

(a) Implement a location-estimation problem where the data contains outliers. Generate 50 observations from $\mathcal{N}(5, 1)$ and add 5 outliers at value 50.

(b) Compute the MLE (sample mean) and the Bayesian posterior mean under: - Normal prior: $\mu \sim \mathcal{N}(5, 10^2)$ - Normal likelihood: $x_i \sim \mathcal{N}(\mu, 1)$

(c) Now use a Student-$t$ likelihood with $\nu = 3$ degrees of freedom (heavy tails) instead of Normal. You cannot use conjugacy — implement a grid approximation of the posterior. Compare the posterior mean to the Normal-likelihood result.

(d) Explain why the heavy-tailed likelihood is more robust to outliers. Connect this to the concept of "M-estimation" from robust statistics.


Model Comparison

Exercise 20.19 (**)

Bayes factors for coin fairness.

A coin is flipped 50 times, yielding 30 heads.

(a) Compute the Bayes factor comparing $M_1$: "the coin is fair" ($\theta = 0.5$, point hypothesis) versus $M_2$: "the coin has some unknown bias" ($\theta \sim \text{Beta}(1, 1)$).

Hint: For a point null, $p(D \mid M_1) = \binom{50}{30}(0.5)^{50}$ and $p(D \mid M_2)$ is the Beta-Binomial marginal likelihood.

(b) Repeat for 30/50, 35/50, 40/50, 45/50 heads. Plot the Bayes factor as a function of the number of heads.

(c) At what number of heads does the evidence become "strong" ($\text{BF}_{21} > 10$) against fairness?

(d) Compare with the frequentist approach: at what number of heads does a two-sided binomial test reject $H_0: \theta = 0.5$ at $\alpha = 0.05$?


Exercise 20.20 (**)

Marginal likelihood and the Occam's razor effect.

(a) Consider two models for binomial data: - $M_1$: $\theta \sim \text{Beta}(10, 10)$ (concentrated around 0.5) - $M_2$: $\theta \sim \text{Beta}(1, 1)$ (uniform)

Compute the marginal likelihood $p(D \mid M_k) = \frac{B(\alpha_k + k, \beta_k + n - k)}{B(\alpha_k, \beta_k)} \binom{n}{k}$ for each model.

(b) Generate data from $\theta_{\text{true}} = 0.5$ with $n = 20$. Compute $\text{BF}_{12}$. Repeat for $\theta_{\text{true}} = 0.3$ and $\theta_{\text{true}} = 0.8$. Which model wins in each case?

(c) Explain the result in terms of the Bayesian Occam's razor: the simpler model ($M_1$) concentrates its prior probability on a smaller region but assigns higher probability to data consistent with that region.


Advanced and Research Problems

Exercise 20.21 (***)

Bayesian linear regression from scratch.

Implement Bayesian linear regression with a conjugate prior.

Model: $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$, $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})$, with known $\sigma^2$.

Prior: $\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{m}_0, \mathbf{S}_0)$.

(a) Derive the posterior $p(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X})$ as a multivariate Normal. Express the posterior mean $\mathbf{m}_n$ and posterior covariance $\mathbf{S}_n$ in terms of $\mathbf{X}, \mathbf{y}, \sigma^2, \mathbf{m}_0, \mathbf{S}_0$.

(b) Show that when $\mathbf{S}_0 = \tau^2 \mathbf{I}$, the posterior mean equals the ridge regression solution with $\lambda = \sigma^2 / \tau^2$.

(c) Implement the posterior computation in numpy. Generate data from a 5-feature linear model. Plot the posterior marginal distributions for each $\beta_j$ alongside the true values.

(d) Compute the posterior predictive distribution for a new observation $\tilde{x}$: derive $p(\tilde{y} \mid \tilde{x}, D)$ and show it is Normal with mean $\tilde{x}^\top \mathbf{m}_n$ and variance $\sigma^2 + \tilde{x}^\top \mathbf{S}_n \tilde{x}$.


Exercise 20.22 (***)

Empirical Bayes for the StreamRec category priors.

Empirical Bayes estimates hyperparameters from data rather than specifying them. For the StreamRec user-category engagement model:

(a) Simulate data for 1,000 users across 6 categories. Each user's true engagement rate for each category is drawn from $\text{Beta}(\alpha_c^{\text{true}}, \beta_c^{\text{true}})$, with different true hyperparameters per category. Each user has 10-100 interactions per category.

(b) Implement method-of-moments estimation: for each category, compute the sample mean $\bar{p}$ and sample variance $s^2$ of per-user engagement rates, then solve:

$$\hat{\alpha} = \bar{p}\left(\frac{\bar{p}(1 - \bar{p})}{s^2} - 1\right), \quad \hat{\beta} = (1 - \bar{p})\left(\frac{\bar{p}(1 - \bar{p})}{s^2} - 1\right)$$

(c) Implement maximum marginal likelihood estimation: maximize $\sum_{u} \log p(k_u, n_u \mid \alpha, \beta)$ where $p(k, n \mid \alpha, \beta) = \binom{n}{k}\frac{B(\alpha + k, \beta + n - k)}{B(\alpha, \beta)}$ using scipy.optimize.

(d) Compare the two estimation methods. Which recovers the true hyperparameters more accurately?


Exercise 20.23 (***)

The Savage-Dickey density ratio for nested model comparison.

The Savage-Dickey density ratio provides a simple way to compute Bayes factors for nested models. If $M_1$ is the null hypothesis $\theta = \theta_0$ nested within $M_2$ (which allows $\theta$ to vary), then:

$$\text{BF}_{12} = \frac{p(\theta_0 \mid D, M_2)}{p(\theta_0 \mid M_2)}$$

That is, the Bayes factor equals the posterior density at $\theta_0$ divided by the prior density at $\theta_0$.

(a) Derive this result from the definition of the Bayes factor.

(b) Apply it to test $\theta = 0.5$ (fair coin) against $\theta \sim \text{Beta}(1, 1)$, with data $k = 30, n = 50$. Verify it matches the direct computation from Exercise 20.19.

(c) Implement a general function savage_dickey_bf(theta_0, posterior_density_at_theta_0, prior_density_at_theta_0) and test it on three different datasets.


Exercise 20.24 (****)

Bayesian inference in high dimensions: when the posterior concentrates off the MAP.

In high-dimensional parameter spaces, the posterior mode (MAP) can be unrepresentative of the posterior distribution, because the high-density region occupies negligible volume.

(a) Consider a $d$-dimensional isotropic Gaussian posterior: $\boldsymbol{\theta} \mid D \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_d)$. The mode is at $\mathbf{0}$. Show that the squared distance $\|\boldsymbol{\theta}\|^2$ follows a $\sigma^2 \chi^2_d$ distribution, with mean $d\sigma^2$ and standard deviation $\sigma^2\sqrt{2d}$.

(b) For $d = 100$ and $\sigma = 1$, compute $P(\|\boldsymbol{\theta}\| \leq 1 \mid D)$ — the probability of being "near" the mode. (Hint: use the $\chi^2$ CDF.)

(c) Plot $P(\|\boldsymbol{\theta}\| \leq 1 \mid D)$ as a function of $d$ from 1 to 1,000. At what dimension does this probability become less than $10^{-10}$?

(d) Explain why this phenomenon means that MAP estimation becomes increasingly misleading as the number of parameters grows, and why sampling-based methods (MCMC) provide a more faithful representation of the posterior.


Exercise 20.25 (****)

Bayesian nonparametric taste: the Beta process.

The Beta-Binomial model assumes a fixed number of categories. In reality, StreamRec adds new content categories over time.

(a) Read about the Indian Buffet Process (IBP) as a prior over binary matrices with an unknown number of features. Summarize the generative process in your own words.

(b) Implement a simplified version: a "streaming category" model where: - New categories can appear at any time - Each user's preference for a new category starts with the population prior - The number of categories grows logarithmically with the number of items (a property of the IBP)

(c) Simulate 10,000 items arriving one at a time, where new categories appear with probability $\alpha / (n + \alpha)$ (the CRP-like rule) with $\alpha = 5$. Track the number of categories over time and compare with $\alpha \log(n)$.

(d) Discuss: when would a Bayesian nonparametric model for categories be preferable to simply updating the UserPreferenceModel with a fixed category set? What are the computational costs?


Application Exercises

Exercise 20.26 (**)

Bayesian reasoning about rare events.

A spam filter flags emails as spam based on certain keywords. The overall spam rate is 1% (prior). A particular keyword appears in 90% of spam emails and 5% of non-spam emails.

(a) Using Bayes' theorem, compute the probability that an email containing this keyword is spam.

(b) Now suppose the spam rate increases to 10%. Recompute.

(c) Plot $P(\text{spam} \mid \text{keyword present})$ as a function of the base rate $P(\text{spam})$ from 0.001 to 0.5. At what base rate does the keyword become "more likely spam than not"?

(d) This is the "base rate fallacy" — explain why ignoring the base rate leads to poor classification decisions. Connect to precision and recall.


Exercise 20.27 (**)

Bayesian updating in a clinical trial.

A clinical trial enrolls patients sequentially. At each interim analysis (every 20 patients), the trial team updates the posterior on the treatment effect and checks whether to stop early.

(a) Implement a sequential updating simulation. True treatment effect: $\theta_{\text{true}} = 2.0$ mmHg reduction. Each patient's outcome is $x_i \sim \mathcal{N}(\theta_{\text{true}}, 10^2)$. Prior: $\theta \sim \mathcal{N}(0, 5^2)$.

(b) After each interim analysis, compute $P(\theta > 0 \mid D)$. Stop the trial if this exceeds 0.99 (efficacy) or if $P(\theta < 0 \mid D) > 0.95$ (futility).

(c) Simulate 1,000 trials. Report: (i) the average number of patients enrolled before stopping, (ii) the percentage of trials that correctly conclude efficacy, (iii) the percentage that incorrectly conclude futility.

(d) Compare with a fixed-sample design that enrolls 200 patients and tests at $\alpha = 0.05$. Which design is more efficient?


Exercise 20.28 (***)

Bayesian change-point detection.

StreamRec's daily active users (DAU) follows a normal distribution, but the mean may shift at an unknown change point.

(a) Implement a Bayesian change-point model for a single change point. Prior on the change-point location: uniform over $\{1, \ldots, T\}$. Prior on the means before and after the change point: $\mu_1, \mu_2 \overset{\text{iid}}{\sim} \mathcal{N}(0, 100^2)$. Known $\sigma^2 = 25$.

(b) Compute the posterior distribution over the change-point location using a grid computation (enumerate all possible change points and compute the marginal likelihood for each).

(c) Generate synthetic data: $T = 100$ days, $\mu_1 = 50$ for days 1-60, $\mu_2 = 55$ for days 61-100, $\sigma = 5$. Run your detector. Plot the posterior over the change-point location.

(d) How does the posterior change if the shift is smaller ($\mu_2 = 51$)? At what shift size does the posterior become approximately uniform (no detectable change)?


Exercise 20.29 (***)

Bayesian calibration of a machine learning model.

A neural network produces predicted probabilities $\hat{p}_i$ for binary classification. These predictions may not be calibrated (i.e., $P(Y = 1 \mid \hat{p} = 0.7) \neq 0.7$).

(a) Implement Bayesian calibration using a Beta-Binomial model. Bin the predictions into 10 bins by $\hat{p}$. For each bin, use a Beta(1, 1) prior and update with the observed hit rate.

(b) Generate synthetic uncalibrated predictions: $\hat{p}_i \sim \text{Uniform}(0, 1)$, $y_i \sim \text{Bernoulli}(\text{logistic}(3\hat{p}_i - 1.5))$. This creates a model that is overconfident for low predictions and underconfident for high predictions.

(c) Plot the reliability diagram using both the raw frequencies and the Bayesian posterior means per bin. How do the Bayesian estimates differ from the raw frequencies, especially for bins with few samples?

(d) Extend to continuous calibration using Platt scaling as a Bayesian logistic regression: $P(Y = 1 \mid \hat{p}) = \sigma(a \hat{p} + b)$ with priors on $a$ and $b$. Implement using grid approximation.


Exercise 20.30 (****)

The complete Bayesian workflow for StreamRec.

This capstone exercise combines all elements from the chapter.

(a) Design the complete generative model for StreamRec user engagement: - Hierarchical: population → category → user → interaction - Each user has a latent preference vector $\boldsymbol{\theta}_u = (\theta_{u,1}, \ldots, \theta_{u,C})$ - Each $\theta_{u,c}$ has a category-level Beta prior with hyperparameters $(\alpha_c, \beta_c)$ - The hyperparameters themselves have hyperpriors: $\alpha_c, \beta_c \sim \text{Gamma}(2, 0.5)$

(b) Implement prior predictive checks: simulate 1,000 users from the full generative model (sampling hyperparameters, then user-level parameters, then data). Do the simulated engagement rates look plausible?

(c) Generate a synthetic dataset: 500 users, 6 categories, 10-200 interactions per user per category. Fit the model in two ways: - Empirical Bayes (point-estimate hyperparameters, then conjugate update) - Grid approximation of the full hierarchical posterior (discretize the hyperparameter space)

(d) Run posterior predictive checks: for each approach, simulate new data from the fitted model and compare with the observed data distribution.

(e) Implement Thompson sampling using both the empirical Bayes and hierarchical posteriors. Simulate 10,000 recommendations for a new user. Which approach explores more? Which achieves higher cumulative engagement?

(f) Write a one-page design document for deploying this system at StreamRec, covering: storage requirements, update latency, exploration strategy, prior estimation cadence, and monitoring.