Chapter 3: Key Takeaways
Probability Theory and Statistical Inference
-
Every loss function is a negative log-likelihood. Binary cross-entropy comes from the Bernoulli distribution, categorical cross-entropy from the categorical distribution, and MSE from the Gaussian distribution. Choosing a loss function is choosing a probabilistic model of your data — make that choice consciously, not by convention.
-
The exponential family unifies distributions and explains algorithm structure. Bernoulli, Gaussian, Poisson, Categorical, and Exponential distributions all belong to the exponential family. This shared structure guarantees convex log-likelihoods, enables conjugate Bayesian inference, and explains why softmax and logistic functions appear as natural output activations.
-
MLE is asymptotically optimal but finite samples require care. The MLE is consistent (converges to the truth), asymptotically normal, and efficient (achieves the Cramér-Rao bound). But with small datasets — common in clinical trials, A/B tests with small effects, and rare-event modeling — asymptotic guarantees do not apply. Bayesian methods and bootstrap confidence intervals provide more reliable uncertainty in the finite-sample regime.
-
Frequentist and Bayesian are tools, not tribes. A frequentist confidence interval tells you about repeated sampling behavior; a Bayesian credible interval tells you about posterior belief given data and prior. Use frequentist methods when regulatory frameworks require them or when data is abundant. Use Bayesian methods when prior knowledge is strong, data is scarce, or you need sequential updating. The right question is "which is more useful here?" not "which is correct?"
-
Concentration inequalities give finite-sample guarantees. The CLT tells you what happens as $n \to \infty$; Hoeffding's inequality tells you what you can guarantee for a specific $n$. Use concentration inequalities to determine sample sizes for A/B tests, bound generalization error, and reason about how quickly your estimators converge.
-
Monte Carlo methods convert intractable integrals into computable averages. When closed-form solutions are unavailable — posterior expectations, marginal likelihoods, expectations over complex distributions — draw samples and average. The standard error decreases as $1/\sqrt{N}$, and importance sampling can accelerate convergence when the proposal distribution is well-chosen.
-
Regularization is a prior in disguise. $L_2$ regularization is MAP estimation with a Gaussian prior; $L_1$ regularization is MAP estimation with a Laplace prior. The regularization coefficient $\lambda$ controls the prior precision. This connection means that every regularized model is implicitly a Bayesian model with a specific prior belief about parameter magnitudes.