Probabilistic and Bayesian methods provide a principled framework for reasoning under uncertainty. Instead of producing single-point predictions, Bayesian models produce full distributions over parameters and predictions, answering not just "What is the best guess?" but also "How confident should we be?" This makes them essential for safety-critical applications, small-data regimes, and decision-making under ambiguity.
The prior $p(\theta)$ encodes what we believe before seeing data. The likelihood $p(\mathcal{D} \mid \theta)$ quantifies how well the data fits a given parameter value. The posterior $p(\theta \mid \mathcal{D})$ is the updated belief after observing data.
The posterior predictive distribution integrates over parameter uncertainty to produce predictions: $p(\tilde{y} \mid \mathcal{D}) = \int p(\tilde{y} \mid \theta) \, p(\theta \mid \mathcal{D}) \, d\theta$.
Prior Selection
Conjugate priors make the posterior analytically tractable (Beta-Binomial, Normal-Normal, Gamma-Poisson).
Weakly informative priors regularize without dominating the data (e.g., broad Gaussians).
Informative priors encode domain knowledge (e.g., historical data from previous experiments).
Always perform prior predictive checks: sample from the prior and verify that the implied predictions are plausible.
Sensitivity analysis verifies that conclusions are robust across reasonable prior choices.
MAP vs. Full Bayesian
The Maximum A Posteriori (MAP) estimate finds the single most probable parameter value. It is equivalent to regularized maximum likelihood (L2 regularization corresponds to a Gaussian prior).
Full Bayesian inference integrates over the entire posterior, capturing uncertainty. This is more expensive but more informative than MAP.
Models and Algorithms
Naive Bayes Classifiers
Assume features are conditionally independent given the class: $p(\mathbf{x} \mid y) = \prod_j p(x_j \mid y)$.
Gaussian Naive Bayes models each feature as a Gaussian per class.
Multinomial Naive Bayes is the workhorse for text classification (bag-of-words features).
Despite the strong independence assumption, Naive Bayes is often surprisingly competitive, especially with limited data.
Training is extremely fast ($\mathcal{O}(n \cdot d)$) and requires no gradient optimization.
Bayesian Linear Regression
Places a prior on the weight vector: $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \sigma_w^2 \mathbf{I})$.
The posterior is Gaussian with closed-form mean and covariance.
Predictions include uncertainty bands that widen in regions with fewer observations.
L2 regularization (Ridge regression) is the MAP special case of Bayesian linear regression with a Gaussian prior.
Markov Chain Monte Carlo (MCMC)
When the posterior is intractable, MCMC generates samples from it by constructing a Markov chain whose stationary distribution is the target posterior.
Metropolis-Hastings proposes new states and accepts/rejects based on the posterior ratio. Simple to implement but can be slow to converge.
Gibbs sampling samples each parameter from its full conditional distribution. Efficient when conditionals are available but mixes poorly with strong correlations.
Hamiltonian Monte Carlo (HMC) uses gradient information to propose distant, high-probability states. More efficient in high dimensions.
MCMC diagnostics include trace plots, autocorrelation, effective sample size, and the Gelman-Rubin $\hat{R}$ statistic.
Variational Inference
Approximates the posterior with a simpler distribution $q(\theta)$ by maximizing the Evidence Lower Bound (ELBO).
Mean-field approximation assumes all parameters are independent under $q$ -- fast but may miss correlations.
Trades accuracy for speed: variational inference scales better than MCMC to large datasets and high-dimensional models.
The quality of the approximation depends on the expressiveness of the variational family.
Gaussian Processes
A Gaussian process (GP) defines a distribution over functions: any finite collection of function values is jointly Gaussian.
Fully specified by a mean function (usually zero) and a kernel (covariance) function.
RBF kernel: smooth functions with a characteristic length scale.
Periodic kernel: captures periodic patterns.
Composite kernels: sums and products of kernels model complex structure (trend + seasonality + noise).
GP regression provides exact posterior predictive distributions with calibrated uncertainty.
Marginal likelihood provides a principled way to optimize kernel hyperparameters and compare kernel structures.
Computational cost is $\mathcal{O}(n^3)$ due to matrix inversion, limiting GPs to moderate datasets. Sparse approximations (inducing points) reduce this to $\mathcal{O}(nm^2)$ where $m \ll n$.
Ignoring prior sensitivity: Always check that your conclusions hold under different reasonable priors.
Trusting unchecked MCMC: Always examine trace plots, $\hat{R}$, and effective sample size before interpreting results.
Overly vague priors: "Non-informative" priors can be improper or lead to poor performance. Weakly informative priors are generally preferred.
Forgetting computational cost: Full Bayesian inference with MCMC can be orders of magnitude slower than point estimation. Budget your compute accordingly.
Confusing credible intervals with confidence intervals: A 95% credible interval means there is a 95% probability the parameter lies within the interval (given the model and data). This is a direct probability statement, unlike a frequentist confidence interval.
Bayesian vs. Frequentist: A Pragmatic View
Bayesian methods treat parameters as random variables with distributions; frequentist methods treat them as fixed but unknown.
In practice, the choice often matters less than correct implementation: both approaches converge with enough data.
Bayesian methods shine with small data, sequential updating, and when uncertainty quantification is needed.
Frequentist methods are often simpler and faster for large-scale problems.
Many modern techniques (regularization, dropout, ensemble methods) have both Bayesian and frequentist interpretations.