Chapter 21: Quiz

DataField.Dev

Chapter 21: Quiz

Test your understanding of practical Bayesian modeling. Answers follow each question.

Question 1

What problem does MCMC solve that conjugate models (Chapter 20) do not?

Answer

MCMC solves the problem of computing posteriors when the integral $p(D) = \int p(D \mid \theta) \, p(\theta) \, d\theta$ is intractable — that is, when no closed-form posterior exists. Conjugate models have closed-form posteriors by construction, but most real-world models (hierarchical models, logistic regression, models with non-conjugate priors) do not. MCMC generates samples from the posterior without computing the normalizing constant, because the Metropolis-Hastings acceptance ratio cancels $p(D)$.

Question 2

Explain in one sentence how the Metropolis-Hastings algorithm avoids computing the intractable normalizing constant $p(D)$.

Answer

The acceptance ratio is $\alpha = p(\theta^* \mid D) / p(\theta_t \mid D)$, which equals $p(D \mid \theta^*) p(\theta^*) / [p(D \mid \theta_t) p(\theta_t)]$ — the $p(D)$ terms in the numerator and denominator cancel, so only the unnormalized posterior (likelihood times prior) is needed.

Question 3

What is the key advantage of HMC/NUTS over basic Metropolis-Hastings?

Answer

HMC/NUTS uses the gradient of the log-posterior to guide proposals along the posterior surface (via Hamiltonian dynamics), producing distant proposals that still have high acceptance probability. This dramatically reduces autocorrelation and improves scaling with dimensionality: HMC scales as $O(d^{5/4})$ versus $O(d^2)$ for random-walk Metropolis-Hastings. The result is far higher effective sample sizes per unit of computation time.

Question 4

You fit a model and observe $\hat{R} = 1.08$ for one parameter. Should you trust the posterior estimates? What should you do?

Answer

No. $\hat{R} = 1.08$ exceeds the threshold of 1.01 (and even the looser threshold of 1.05), indicating that the chains have not converged to the same distribution. The between-chain variance is substantially larger than the within-chain variance, meaning different chains are exploring different regions. You should: (1) increase the number of warm-up iterations, (2) check for multimodality, (3) consider reparameterizing the model, and (4) rerun before interpreting any results.

Question 5

A model has ESS bulk of 150 across 4 chains with 2,000 draws each (8,000 total). Is this adequate? What does it imply about the MCMC efficiency?

Answer

No. ESS bulk of 150 is below the recommended minimum of 400 for reliable posterior mean estimates. The sampling efficiency is $150 / 8000 = 0.019$ (1.9%), meaning 98% of the samples carry no additional information due to autocorrelation. This suggests the sampler is struggling — likely a random-walk behavior caused by strong parameter correlations or poor geometry. Remedies: reparameterize (e.g., non-centered parameterization), increase `target_accept`, or add more draws.

Question 6

What is a divergence in HMC, and why is it the most important diagnostic?

Answer

A divergence occurs when the leapfrog integrator in HMC fails to accurately simulate the Hamiltonian trajectory, typically in regions of high posterior curvature. Divergences are the most important diagnostic because they indicate that **the sampler may be systematically missing regions of the posterior**, leading to biased estimates — not just noisy estimates, but systematically wrong ones. A model with divergences cannot be trusted. The most common fix is the non-centered parameterization for hierarchical models.

Question 7

What are the five stages of the Bayesian workflow?

Answer

1. **Prior predictive check:** Simulate data from the prior predictive distribution to verify that the model-plus-prior combination generates plausible data. 2. **Fit the model:** Run MCMC (typically NUTS) to sample from the posterior. 3. **Validate computation (MCMC diagnostics):** Check $\hat{R}$, ESS, and divergences to confirm the sampler converged correctly. 4. **Posterior predictive check:** Generate replicated data from the posterior predictive distribution and compare it to the observed data to assess model fit. 5. **Model comparison:** Compare alternative models using WAIC or LOO-CV to select the model with the best predictive performance. The workflow is iterative: failures at any stage send you back to revise the model.

Question 8

Explain the difference between complete pooling, no pooling, and partial pooling with a concrete example.

Answer

Suppose you are estimating click-through rates for 50 content categories. - **Complete pooling:** Estimate a single rate for all categories: $\hat{\theta} = \text{total clicks} / \text{total impressions}$. This ignores real differences between categories. - **No pooling:** Estimate each category independently: $\hat{\theta}_j = \text{clicks}_j / \text{impressions}_j$. This respects differences but produces noisy estimates for categories with few impressions. - **Partial pooling (hierarchical):** Estimate each category's rate as a compromise between its own data and the population distribution. Categories with few impressions are pulled toward the population mean; categories with abundant data are dominated by their own evidence. Partial pooling typically has the lowest mean squared error because it reduces variance for small groups (via shrinkage) without introducing excessive bias.

Question 9

In the eight schools example, School A reported a treatment effect of 28.39 points. The hierarchical model estimates 11.28 points. Why is the hierarchical estimate so much lower?

Answer

School A's standard error is 14.9 — very large relative to the estimate of 28.39. This means the raw estimate is noisy and unreliable. The hierarchical model recognizes that the other seven schools show effects ranging from $-2.75$ to $18.01$, with a population mean around 8. Given the large standard error, the model concludes that School A's extreme value is more likely due to noise than a genuine 28-point effect, and shrinks the estimate toward the population mean. The amount of shrinkage (83%) is proportional to the ratio of the school's noise ($\sigma_j^2$) to the total uncertainty ($\sigma_j^2 + \tau^2$).

Question 10

What is the non-centered parameterization, and why does it resolve divergence problems in hierarchical models?

Answer

The **centered parameterization** specifies $\theta_j \sim \mathcal{N}(\mu, \tau)$ directly. The **non-centered parameterization** introduces an auxiliary variable: $\eta_j \sim \mathcal{N}(0, 1)$ and sets $\theta_j = \mu + \tau \cdot \eta_j$. The centered parameterization creates **Neal's funnel**: when $\tau$ is small, the $\theta_j$ values are tightly constrained near $\mu$, creating a narrow funnel in parameter space that the sampler cannot navigate with a fixed step size. The non-centered parameterization breaks this coupling — $\eta_j$ is always standard normal regardless of $\tau$, so the geometry is well-behaved even when $\tau$ is near zero. Both parameterizations define the same model; only the computational geometry differs.

Question 11

What is WAIC, and how does it relate to AIC?

Answer

WAIC (Widely Applicable Information Criterion) is the Bayesian generalization of AIC. It estimates out-of-sample predictive accuracy as: $$\text{WAIC} = -2(\text{lppd} - p_{\text{WAIC}})$$ where lppd is the log pointwise predictive density (average log-likelihood across posterior samples) and $p_{\text{WAIC}}$ is the effective number of parameters (a complexity penalty computed as the sum of per-observation variance of the log-likelihood across posterior samples). Like AIC, WAIC balances fit (lppd) against complexity ($p_{\text{WAIC}}$). Unlike AIC, WAIC uses the full posterior distribution rather than a point estimate, making it valid for hierarchical and complex models where AIC's assumptions break down. Lower WAIC is better.

Question 12

What does a Pareto $\hat{k}$ value of 0.85 for an observation indicate in PSIS-LOO?

Answer

A Pareto $\hat{k}$ value of 0.85 (above the 0.7 threshold) indicates that the importance sampling approximation for that observation's leave-one-out predictive density is unreliable. This means that observation has a disproportionate influence on the posterior — it is so unusual that removing it substantially changes the model fit. The LOO estimate for that observation cannot be trusted. Solutions: (1) refit the model without that observation to get the exact LOO value, (2) investigate whether the observation is an outlier or data error, or (3) consider a model with heavier tails (e.g., Student-t likelihood) that can accommodate influential observations.

Question 13

True or False: The non-centered and centered parameterizations of a hierarchical model define different probability models.

Answer

**False.** Both parameterizations define the same joint probability distribution over observed and latent variables. The marginal and conditional distributions are identical. The only difference is the computational geometry of the posterior — how the parameter space is traversed by the MCMC sampler. The non-centered parameterization eliminates the funnel geometry that causes divergences, but the statistical model and its predictions are unchanged.

Question 14

A colleague says: "I used a Bayesian model with a flat (uniform) prior, so my analysis is completely objective." Evaluate this claim.

Answer

This claim is incorrect on multiple levels. First, a flat prior is not "objective" — it implicitly states that a parameter value of $10^6$ is as plausible as a value of 1, which encodes a strong (and usually unreasonable) prior belief. Second, flat priors are not invariant under reparameterization: a uniform prior on $\theta$ implies a non-uniform prior on $\log(\theta)$ or $\theta^2$. Third, flat priors often cause computational problems in MCMC (poor exploration, slow convergence) and can produce improper posteriors in some models. A weakly informative prior that encodes the plausible range of the parameter is both more honest and more computationally practical than a flat prior.

Question 15

In the StreamRec hierarchical model, the Experimental category has 290 impressions and a raw engagement rate of 9.66%. The hierarchical estimate is closer to 15%. Explain why.

Answer

The Experimental category has very few impressions (290), making its raw rate unreliable. The hierarchical model estimates a population-level mean engagement rate (around 25-30% across all categories). Because the Experimental category has sparse data, the hierarchical model heavily shrinks its estimate toward the population mean. The raw rate of 9.66% is pulled upward toward the population center because the model "trusts" the population distribution more than the noisy estimate from 290 observations. If the category had 50,000 impressions, the hierarchical estimate would be very close to the raw rate.

Question 16

When should you use Bayesian methods versus frequentist mixed-effects models? Name two conditions that favor Bayesian and two that favor frequentist.

Answer

**Favor Bayesian:** 1. When you need full posterior distributions (e.g., $P(\text{effect} > 0 \mid D)$, expected loss calculations) rather than just point estimates and p-values. 2. When you have genuine informative prior knowledge (e.g., meta-analytic priors from previous studies) that should formally influence the analysis. **Favor frequentist:** 1. When computational cost matters and the model is large — frequentist mixed-effects models (REML) fit in seconds while MCMC may take hours for the same model. 2. When regulatory or organizational requirements mandate frequentist reporting (e.g., FDA submissions require p-values and confidence intervals as the primary analysis).

Question 17

You fit a Bayesian model and the posterior predictive check shows that the replicated data has much less variance than the observed data. What does this suggest?

Answer

This suggests the model is **underdispersed** — it does not capture the full variability in the data. Common causes: (1) the model assumes a distribution with too little variance (e.g., Poisson when the data are overdispersed — a negative binomial would be more appropriate), (2) important sources of variation are missing from the model (e.g., a hierarchical structure that is not modeled, or unaccounted-for covariates), or (3) the data has outliers that the model's thin-tailed distributions cannot accommodate (consider a Student-t likelihood). The posterior predictive check has revealed a specific way the model is wrong, and the remedy is to revise the model to capture the missing variance.

Question 18

What is the "Bayesian Occam's razor" in the context of model comparison?

Answer

The Bayesian Occam's razor is the built-in complexity penalty in marginal likelihood-based model comparison. A complex model spreads its prior predictive probability over a wide range of possible datasets. A simple model concentrates its predictions on a narrower range. If the observed data falls within the simple model's predicted range, the simple model achieves a higher marginal likelihood — not because it fits the data better in a maximum-likelihood sense, but because it made more precise predictions. This automatically penalizes unnecessary complexity without requiring an explicit penalty term (unlike AIC or BIC). WAIC and LOO-CV inherit a version of this property.

Question 19

Explain why standardizing predictors improves MCMC performance.

Answer

Standardizing predictors (subtracting the mean and dividing by the standard deviation) improves MCMC performance because it reduces the posterior correlation between the intercept and slope parameters, and it puts all parameters on comparable scales. Without standardization, the posterior geometry can have strong correlations and vastly different scales across dimensions, forcing NUTS to use a very small step size (to avoid divergences in the tightest dimension) that makes exploration slow in other dimensions. With standardized predictors, the posterior is typically more spherical, allowing larger step sizes, faster exploration, lower autocorrelation, and higher effective sample sizes.

Question 20

A model has zero divergences, $\hat{R} = 1.00$ for all parameters, and ESS > 2,000 for all parameters. Does this guarantee that the posterior is correct?

Answer

**No.** Clean diagnostics indicate that the MCMC sampler converged and explored the posterior efficiently — but they say nothing about whether the model is correct. The posterior is the correct posterior *for this model*, but the model itself may be misspecified (wrong likelihood, wrong functional form, missing covariates, missing hierarchical structure). That is why posterior predictive checks and model comparison are essential stages of the Bayesian workflow — they assess whether the model, not just the computation, is adequate. Clean MCMC diagnostics are necessary but not sufficient for trustworthy inference.