Chapter 21: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field


PyMC Fundamentals

Exercise 21.1 (*)

Build a PyMC model for Bayesian linear regression on the following simulated dataset:

import numpy as np
rng = np.random.default_rng(123)
n = 150
x = rng.normal(0, 1, size=n)
y = 3.0 + 2.5 * x + rng.normal(0, 1.2, size=n)

(a) Specify weakly informative priors: $\alpha \sim \mathcal{N}(0, 10)$, $\beta \sim \mathcal{N}(0, 10)$, $\sigma \sim \text{HalfNormal}(5)$.

(b) Sample 2,000 draws with 1,000 tuning steps and 4 chains.

(c) Report the posterior mean and 94% HDI for all three parameters. How close are they to the true values?

(d) Plot the trace and verify convergence visually.


Exercise 21.2 (*)

Extend Exercise 21.1 to a multiple regression with three predictors:

rng = np.random.default_rng(456)
n = 200
X = rng.normal(0, 1, size=(n, 3))
true_betas = np.array([1.5, -0.8, 0.3])
y = 2.0 + X @ true_betas + rng.normal(0, 0.9, size=n)

(a) Build a PyMC model with a shared prior $\beta_k \sim \mathcal{N}(0, 5)$ for each coefficient.

(b) Use az.summary() to report all posterior summaries.

(c) Which coefficient has the widest 94% HDI? Why might that be?


Exercise 21.3 (*)

Build a PyMC model for Bayesian Poisson regression. A hospital records daily emergency room visits for 90 days:

rng = np.random.default_rng(789)
n = 90
temperature = rng.normal(75, 10, size=n)  # Fahrenheit
is_weekend = rng.binomial(1, 2/7, size=n)
true_lambda = np.exp(2.5 + 0.01 * temperature + 0.3 * is_weekend)
visits = rng.poisson(true_lambda)

(a) Specify the model with weakly informative priors on all coefficients.

(b) Sample and report posterior summaries.

(c) Compute the posterior predictive distribution for a weekday at 85$^\circ$F and a weekend at 85$^\circ$F. What is the posterior mean difference?


MCMC Diagnostics

Exercise 21.4 (*)

For the linear regression model from Exercise 21.1:

(a) Extract $\hat{R}$ for all parameters. Are any above 1.01?

(b) Extract ESS bulk and ESS tail. Compute the sampling efficiency (ESS / total draws).

(c) Check for divergences. Report the count.

(d) Write a single function run_full_diagnostics(trace) that prints a pass/fail summary for all three checks.


Exercise 21.5 (**)

Deliberately create a poorly-fitting model to observe diagnostic failures:

(a) Fit a centered-parameterization hierarchical model to the eight schools data:

with pm.Model() as centered_eight_schools:
    mu = pm.Normal("mu", mu=0, sigma=20)
    tau = pm.HalfCauchy("tau", beta=10)
    theta = pm.Normal("theta", mu=mu, sigma=tau, shape=8)
    y = pm.Normal("y", mu=theta, sigma=sigma, observed=y_obs)
    trace_centered = pm.sample(draws=2000, tune=1000, chains=4, random_seed=42,
                                return_inferencedata=True)

(b) Count the number of divergences. Where do they concentrate in parameter space? (Hint: use az.plot_pair(trace_centered, var_names=["tau", "theta"], divergences=True).)

(c) Refit with the non-centered parameterization. Verify that divergences disappear.

(d) Compare the posterior summaries from both parameterizations. Do the divergences bias the centered model's estimates?


Exercise 21.6 (**)

Investigate the effect of target_accept on MCMC performance:

(a) Fit the eight schools model (non-centered) with target_accept values of 0.65, 0.80, 0.90, 0.95, and 0.99.

(b) For each setting, record: number of divergences, minimum ESS bulk, total sampling time.

(c) Plot ESS vs. sampling time for each setting. Is there a sweet spot that balances efficiency and reliability?

(d) What happens at target_accept=0.99? Why does extreme caution carry a cost?


Bayesian Workflow

Exercise 21.7 (*)

Perform a prior predictive check for the Poisson regression model from Exercise 21.3.

(a) Sample 1,000 datasets from the prior predictive distribution.

(b) Plot the distribution of predicted daily visit counts. What is the range? Is it plausible?

(c) If the prior predictive produces implausible values (e.g., > 10,000 visits/day), tighten the priors and repeat. Document your reasoning.


Exercise 21.8 (**)

Perform the complete Bayesian workflow for a Beta-Binomial model applied to website conversion rates.

A website has 5 landing pages. Over one week, the data are:

Page Visitors Conversions
A 2,340 187
B 1,120 112
C 450 63
D 180 32
E 55 14

(a) Prior predictive check: Use $\theta_j \sim \text{Beta}(2, 20)$ as the prior (mean 9.1%). Generate prior predictive samples and verify they are plausible for website conversion.

(b) Fit: Implement a hierarchical model where the five pages share a population-level Beta distribution.

(c) Diagnostics: Check $\hat{R}$, ESS, and divergences.

(d) Posterior predictive check: Simulate replicated data from the posterior predictive. Compare the mean and variance of conversion counts to the observed data.

(e) Model comparison: Compare the hierarchical model with a no-pooling model using LOO. Which is preferred?


Exercise 21.9 (**)

Investigate prior sensitivity for the MediCore pharma hierarchical model:

(a) Fit the model three times with different priors on $\tau$ (between-hospital SD): - $\tau \sim \text{HalfCauchy}(1)$ (tight) - $\tau \sim \text{HalfCauchy}(10)$ (moderate) - $\tau \sim \text{HalfCauchy}(50)$ (diffuse)

(b) Compare the posterior distribution of $\mu_{\text{pop}}$ (population mean treatment effect) under each prior. Plot all three posteriors on the same axis.

(c) Does the choice of prior on $\tau$ meaningfully affect the population mean estimate? What about individual hospital estimates?

(d) Write a one-paragraph summary suitable for a regulatory submission explaining the sensitivity analysis results.


Hierarchical Models

Exercise 21.10 (*)

A school district is evaluating reading scores across 15 elementary schools. Each school tested between 20 and 300 students:

rng = np.random.default_rng(101)
n_schools = 15
true_pop_mean = 72.0
true_pop_sd = 8.0
true_means = rng.normal(true_pop_mean, true_pop_sd, size=n_schools)
sample_sizes = rng.choice([20, 35, 50, 80, 120, 200, 300], size=n_schools)
school_means = np.array([rng.normal(true_means[j], 15.0 / np.sqrt(sample_sizes[j]))
                         for j in range(n_schools)])
school_ses = 15.0 / np.sqrt(sample_sizes)

(a) Fit a hierarchical Normal model with non-centered parameterization.

(b) Identify the school with the most shrinkage and the school with the least. Explain why.

(c) A school board member asks: "Should we reward the school with the highest observed average?" Using the hierarchical posterior, compute the probability that each school truly has the highest mean. How does this ranking differ from the raw-score ranking?


Exercise 21.11 (**)

Extend the school reading score model to include a school-level predictor: the percentage of students receiving free/reduced lunch (a proxy for socioeconomic status).

(a) Add a regression component: $\theta_j = \mu + \beta \cdot \text{lunch\_pct}_j + \tau \cdot \eta_j$.

(b) Fit the model and interpret $\beta$. Is there a credible association between lunch percentage and reading scores?

(c) Compare this model with the intercept-only hierarchical model using LOO. Does the predictor improve out-of-sample prediction?


Exercise 21.12 (**)

Implement a hierarchical model with varying slopes (not just varying intercepts).

A retail chain has 20 stores. For each store, you have monthly sales data over 24 months. You want to estimate the effect of advertising spend on sales, allowing both the intercept (baseline sales) and the slope (advertising effectiveness) to vary by store.

(a) Specify the model with group-level priors on both the intercept and the slope. Use a shared multivariate normal prior: $(\alpha_j, \beta_j) \sim \text{MvNormal}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ with an LKJ prior on the correlation matrix.

(b) Fit the model and check diagnostics.

(c) Extract and visualize the store-specific slopes. Which stores show the strongest advertising response?

(d) Compare this model with a varying-intercept-only model using LOO. Does allowing varying slopes improve prediction?


Exercise 21.13 (***)

The StreamRec hierarchical Beta-Binomial model from Section 21.9 uses the $(\mu, \kappa)$ parameterization. Implement two alternative parameterizations and compare:

(a) Direct $(\alpha, \beta)$ parameterization: Place priors directly on $\alpha$ and $\beta$ of the population Beta distribution. Use $\alpha \sim \text{Exponential}(1)$ and $\beta \sim \text{Exponential}(1)$.

(b) Logit-normal parameterization: Instead of $\theta_j \sim \text{Beta}(\alpha, \beta)$, use $\text{logit}(\theta_j) \sim \mathcal{N}(\mu, \sigma^2)$ with appropriate priors on $\mu$ and $\sigma$.

(c) Compare all three parameterizations in terms of: number of divergences, ESS, posterior summaries for $\theta_j$, and LOO scores.

(d) Which parameterization would you recommend for production use? Justify your answer.


Model Comparison

Exercise 21.14 (*)

Using the eight schools data, compare the following three models using both WAIC and LOO:

  1. Complete pooling: $y_j \sim \mathcal{N}(\mu, \sigma_j)$
  2. No pooling: $y_j \sim \mathcal{N}(\theta_j, \sigma_j)$ with independent priors on each $\theta_j$
  3. Hierarchical: the non-centered model from Section 21.6

(a) Report WAIC and LOO for each model.

(b) Do WAIC and LOO agree on the ranking?

(c) Check the Pareto $\hat{k}$ values for the LOO estimates. Are any observations problematic ($\hat{k} > 0.7$)?


Exercise 21.15 (**)

Implement model comparison for nested regression models:

rng = np.random.default_rng(202)
n = 300
x1 = rng.normal(0, 1, size=n)
x2 = rng.normal(0, 1, size=n)
x3 = rng.normal(0, 1, size=n)  # Irrelevant predictor
y = 1.0 + 2.0 * x1 - 0.5 * x2 + rng.normal(0, 1.5, size=n)

(a) Fit four models: intercept-only, $x_1$ only, $x_1 + x_2$, $x_1 + x_2 + x_3$.

(b) Compare using LOO. Does the model comparison correctly identify the true data-generating model (2 predictors)?

(c) What happens if you add 10 irrelevant predictors? Does LOO still select the correct model?


Exercise 21.16 (**)

Compare a Bayesian hierarchical model with a frequentist mixed-effects model (using statsmodels or lme4 via rpy2) on the MediCore hospital data:

(a) Fit a frequentist random-effects meta-analysis using the DerSimonian-Laird estimator.

(b) Fit the Bayesian hierarchical model from Section 21.7.

(c) Compare: population mean estimate, between-hospital variance, individual hospital estimates. How close are the two approaches?

(d) Where do the approaches diverge? (Hint: look at the small hospitals and the uncertainty on $\tau$.)


Applied Problems

Exercise 21.17 (*)

A mobile app runs 8 pricing experiments simultaneously across 8 markets. Each market has a different sample size and a different observed conversion rate under the new price. Build a hierarchical model to estimate the true conversion rate in each market, sharing information across markets.

Market Users Conversions
US 12,000 840
UK 8,500 510
Germany 3,200 224
France 2,800 196
Japan 1,500 135
Brazil 800 72
India 400 40
Australia 150 18

(a) Fit a hierarchical Beta-Binomial model.

(b) Report the posterior mean and 94% HDI for each market.

(c) Which market's estimate changes the most relative to its raw rate? Which changes the least? Explain.


Exercise 21.18 (**)

A pharmaceutical company is running a multi-site clinical trial for a new diabetes drug. There are 6 sites, each with a different number of patients. The outcome is HbA1c reduction after 12 weeks.

Build a hierarchical Bayesian model that:

(a) Estimates the population-level treatment effect and its 95% credible interval.

(b) Estimates each site's treatment effect.

(c) Computes the posterior probability that the treatment effect exceeds the clinically meaningful threshold of 0.5% HbA1c reduction.

(d) Performs a prior sensitivity analysis: compare results under an informative prior (based on prior studies of similar drugs) and a weakly informative prior. Summarize the findings.


Exercise 21.19 (**)

Climate application. The climate team has temperature anomaly data for 10 ocean basins. Each basin has a different number of measurement stations and different temporal coverage.

(a) Build a hierarchical model for ocean warming trends ($^\circ$C per decade) across basins.

(b) Perform the full Bayesian workflow: prior predictive check, fit, diagnostics, posterior predictive check.

(c) Identify the basin with the highest posterior probability of a warming trend exceeding 0.25$^\circ$C per decade.

(d) Compare the hierarchical estimates with OLS estimates per basin. For which basins does the hierarchical model change the conclusion about whether warming is statistically significant?


Exercise 21.20 (***)

StreamRec extension: time-varying hierarchical model.

Category engagement rates on StreamRec are not static — they vary over time due to content releases, seasons, and trends. Extend the hierarchical Beta-Binomial model from Section 21.9 to allow time-varying rates.

(a) Simulate data with 25 categories over 12 monthly periods, where each category's true rate follows a random walk: $\text{logit}(\theta_{j,t}) = \text{logit}(\theta_{j,t-1}) + \epsilon_{j,t}$ with $\epsilon_{j,t} \sim \mathcal{N}(0, \sigma_{\text{walk}}^2)$.

(b) Build a PyMC model with a hierarchical structure across categories and a Gaussian random walk over time.

(c) Fit the model and check diagnostics. (Hint: this model is large. You may need target_accept=0.99 and extended tuning.)

(d) Visualize the posterior trajectories for three categories: one large, one medium, one small. How does the uncertainty change over time and across sample sizes?


Exercise 21.21 (***)

Bayesian A/B testing. Implement a complete Bayesian A/B testing framework.

(a) Define a PyMC model for comparing two conversion rates ($\theta_A$ and $\theta_B$) using Beta priors.

(b) Compute the following from the posterior: - $P(\theta_B > \theta_A \mid D)$ (probability that B is better) - Expected loss of choosing B if A is actually better - Expected lift: $E[(\theta_B - \theta_A) / \theta_A \mid D]$

(c) Implement a decision rule: "Ship B if $P(\theta_B > \theta_A \mid D) > 0.95$ AND expected loss < 0.1%."

(d) Simulate 1,000 experiments with known true effects ranging from $-5\%$ to $+5\%$. Plot the empirical "decision accuracy" (fraction of correct ship/no-ship decisions) as a function of the true effect size. Compare with a frequentist z-test at $\alpha = 0.05$.


Exercise 21.22 (***)

Implement variational inference as an alternative to MCMC for the eight schools model.

(a) Use pm.fit(method="advi") to fit the non-centered eight schools model with Automatic Differentiation Variational Inference.

(b) Compare the ADVI posterior approximation with the NUTS posterior: plot the marginal posteriors for $\mu$ and $\tau$ from both methods.

(c) Compute the KL divergence between the ADVI approximate posterior and the NUTS posterior (use kernel density estimates on the NUTS samples).

(d) Where does ADVI fail to capture the true posterior shape? (Hint: look at $\tau$ — it has a skewed posterior that the mean-field Gaussian approximation cannot represent.)

(e) Time both methods. How much faster is ADVI? Is the speed-accuracy tradeoff acceptable for this problem?


Exercise 21.23 (***)

Hierarchical logistic regression. MediCore wants to model the probability of treatment response (binary: responded or not) across hospitals, controlling for patient-level covariates (age, BMI, baseline blood pressure).

(a) Simulate data: 12 hospitals, 50-500 patients each, with hospital-varying intercepts and a shared slope for age.

(b) Build a hierarchical logistic regression model in PyMC:

$$\text{logit}(p_{ij}) = \alpha_j + \beta_1 \cdot \text{age}_{ij} + \beta_2 \cdot \text{BMI}_{ij}$$

where $\alpha_j \sim \mathcal{N}(\mu_\alpha, \sigma_\alpha)$ (non-centered).

(c) Fit and run full diagnostics.

(d) Compute the predicted probability of response for a 60-year-old patient with BMI 28 at each hospital. Report the posterior mean and 94% HDI for each.

(e) Compare with a frequentist mixed-effects logistic regression (e.g., statsmodels.MixedLM or glmer from lme4). Do the hospital rankings agree?


Research-Level Problems

Exercise 21.24 (****)

Posterior calibration. A Bayesian model's credible intervals should be calibrated: a 95% credible interval should contain the true parameter 95% of the time across repeated experiments (under the model's own data-generating process). This is the "frequentist coverage of Bayesian intervals."

(a) For the hierarchical eight schools model, run a simulation-based calibration (SBC) study: 1. Draw true parameters from the prior. 2. Simulate data from the likelihood. 3. Fit the model and compute the rank of the true parameter within the posterior samples. 4. Repeat 500 times. The ranks should be uniformly distributed if the model is calibrated.

(b) Plot the rank histogram for $\mu$, $\tau$, and one $\theta_j$. Are the distributions uniform?

(c) If the rank histogram is non-uniform, what does that indicate? (Hint: consider computational issues like insufficient warm-up or ADVI approximation errors.)


Exercise 21.25 (****)

Scalability. MCMC does not scale trivially to large datasets. Investigate two approaches:

(a) Minibatch MCMC: Use pm.Minibatch to fit the StreamRec model with subsampled data. Compare posterior estimates with the full-data MCMC run for varying minibatch sizes (100, 500, 1000, 5000). At what point do the estimates diverge significantly?

(b) Variational inference: Fit the StreamRec hierarchical model using ADVI. Compare speed, posterior means, and posterior standard deviations with the NUTS results.

(c) Amortized inference: Describe (conceptually, no implementation required) how you would use a neural network to approximate the posterior mapping $D \to q(\theta)$ for the StreamRec model, enabling instant posterior estimation for new category data without re-running MCMC. What are the key design choices?


Exercise 21.26 (****)

Multi-level hierarchical model. Extend the MediCore model to three levels: treatment effects vary by patient within hospital within country.

(a) Design the model: write the full mathematical specification with all three levels.

(b) Implement in PyMC with non-centered parameterizations at every level.

(c) Fit with simulated data (5 countries, 3-8 hospitals per country, 30-200 patients per hospital). Track wall-clock time and convergence.

(d) Compare the three-level model with a two-level (hospital-only) model using LOO. Does the country level add predictive value?


Exercise 21.27 (****)

Prior elicitation. A MediCore clinical expert says: "I believe the treatment effect is most likely between 6 and 12 mmHg, and I would be very surprised if it were negative or above 20." Translate this into a formal prior.

(a) Fit a Normal distribution to the expert's beliefs using the MATCH method (Morris et al., 2014): set the median to 9, the 5th percentile to ~2, and the 95th percentile to ~16. Derive the corresponding $\mu$ and $\sigma$.

(b) Fit a t-distribution (heavier tails) to the same quantiles.

(c) Fit the hierarchical model under both priors and compare posterior estimates. Does the choice of prior distribution family (Normal vs. t) matter?

(d) Present the results in a format suitable for a regulatory submission: report the prior, data, posterior, and sensitivity analysis in a single table.