Chapter 20: Key Takeaways

  1. Bayes' theorem is optimal belief updating, not a philosophy. The posterior is proportional to the likelihood times the prior: $p(\theta \mid D) \propto p(D \mid \theta) \, p(\theta)$. This is not a matter of opinion — it is a mathematical consequence of the axioms of probability. The "frequentist vs. Bayesian" framing is the wrong debate; both are tools, and the question is which tool fits the problem.

  2. Conjugate priors give closed-form posteriors that update by adding counts. The Beta-Binomial conjugate pair updates as Beta$(\alpha + k, \beta + n - k)$: add observed successes to $\alpha$ and failures to $\beta$. The Normal-Normal pair updates by precision-weighted averaging of the prior mean and the sample mean. Conjugacy enables real-time Bayesian systems (like StreamRec's per-user preference model) that update in constant time and memory per observation.

  3. MAP with a Gaussian prior IS L2 regularization; MAP with a Laplace prior IS L1 regularization. The MAP-MLE-regularization triangle reveals that every regularized model is implicitly Bayesian: it assumes a prior on the parameters. A tighter prior (smaller variance) corresponds to stronger regularization (larger $\lambda$). Understanding this connection means you can reason about regularization choices in terms of prior beliefs about parameter magnitudes.

  4. Prior selection is an engineering decision, not a philosophical commitment. The practical framework is: (1) determine the parameter's plausible range and magnitude, (2) choose uninformative, weakly informative, or informative priors accordingly, (3) validate with prior predictive checks (does the prior generate plausible data?), (4) assess sensitivity (does the posterior change meaningfully under alternative reasonable priors?). Weakly informative priors are the default for most applied work.

  5. The posterior provides what practitioners actually want: the probability that the parameter is in a given range. A 95% Bayesian credible interval says "there is a 95% probability the parameter lies here, given the data." A frequentist confidence interval says "if I repeated this experiment many times, 95% of intervals would contain the truth." The Bayesian statement is more directly useful for decision-making, especially when communicating to non-statisticians.

  6. Bayesian methods add the most value when data is scarce, prior knowledge is genuine, and uncertainty matters. Cold-start personalization, small-sample clinical inference, hierarchical estimation across groups with unequal sample sizes, and sequential decision-making under uncertainty are the canonical use cases. When data is abundant and only predictions are needed, the Bayesian posterior converges to the MLE, and the prior adds computational cost without changing the answer.

  7. Sequential updating is a natural consequence of Bayes' theorem, not an approximation. Today's posterior becomes tomorrow's prior. For conjugate models, sequential updating (one observation at a time) and batch updating (all observations at once) produce identical posteriors. This makes Bayesian methods ideal for online systems, streaming data, and adaptive experimentation — any setting where beliefs must be updated continuously as new evidence arrives.