Chapter 21: Key Takeaways

DataField.Dev

Chapter 21: Key Takeaways

MCMC makes Bayesian inference practical for arbitrary models. When conjugate posteriors do not exist — which is most real-world models — Markov chain Monte Carlo generates samples from the posterior without computing the intractable normalizing constant $p(D)$. The NUTS sampler (the default in PyMC) uses gradient information to explore the posterior efficiently, achieving effective sample sizes of 30-80% of total draws in well-specified models. You specify the model; the sampler does the rest.
MCMC diagnostics are not optional — they are part of the inference. A Bayesian analysis is not complete until you have checked three things: (1) no divergences (the sampler explored the full posterior), (2) $\hat{R} < 1.01$ for all parameters (all chains converged to the same distribution), and (3) ESS > 400 for quantities of interest (enough independent samples for reliable estimates). A model that passes all three checks can be trusted. A model that fails any of them cannot, regardless of how appealing the posterior summaries look.
The Bayesian workflow is a discipline of model criticism. Prior predictive check (does the model generate plausible data?), fit (sample the posterior), validate computation (MCMC diagnostics), posterior predictive check (does the model reproduce the observed data?), model comparison (which model predicts best?). The workflow is iterative: failures at any stage send you back to revise the model. The goal is not to find the "true" model — it is to systematically discover and document how your model is wrong.
Hierarchical models with partial pooling are the strongest practical argument for Bayesian methods. When you estimate parameters for multiple groups with unequal sample sizes, partial pooling produces lower mean squared error than either complete pooling (ignoring group differences) or no pooling (estimating each group independently). Small groups borrow strength from the population distribution; large groups are dominated by their own data. The degree of shrinkage is automatically calibrated by the model, not set by an analyst's judgment.
The non-centered parameterization is the single most important trick for hierarchical Bayesian models. Writing $\theta_j = \mu + \tau \cdot \eta_j$ with $\eta_j \sim \mathcal{N}(0, 1)$ instead of $\theta_j \sim \mathcal{N}(\mu, \tau)$ eliminates the "funnel" geometry that causes divergences when $\tau$ is small. Both parameterizations define the same model; the non-centered version simply navigates the posterior more efficiently. If your hierarchical model has divergences, try the non-centered parameterization before anything else.
Model comparison via LOO-CV (PSIS-LOO) balances fit and complexity. WAIC and PSIS-LOO estimate out-of-sample predictive accuracy from the fitted posterior, automatically penalizing models that are more complex than the data can support. The Pareto $\hat{k}$ diagnostic identifies influential observations where the LOO approximation may be unreliable. For hierarchical models, LOO almost always favors the hierarchical model over both complete pooling and no pooling — confirming that partial pooling provides the best bias-variance tradeoff.
Bayesian methods earn their complexity when the problem has hierarchical structure, genuine prior knowledge, unequal group sizes, and a need for calibrated uncertainty. These conditions describe pharmaceutical multi-site trials, recommendation system category estimation, education research across schools, and small-area estimation — among many others. When these conditions do not hold (abundant data, no group structure, point predictions sufficient), frequentist or standard ML methods are faster and equally accurate. The decision to go Bayesian should be driven by problem characteristics, not by methodology preference.