26 min read

> "Probability theory is nothing but common sense reduced to calculation."

Chapter 3: Probability Theory and Statistical Inference

The Mathematical Language of Uncertainty

"Probability theory is nothing but common sense reduced to calculation." — Pierre-Simon Laplace, Théorie analytique des probabilités (1812)


Learning Objectives

By the end of this chapter, you will be able to:

  1. Apply probability axioms, conditional probability, and Bayes' theorem to derive posterior distributions
  2. Characterize common probability distributions and explain their roles in ML (Gaussian, Bernoulli, Categorical, Poisson, Exponential family)
  3. Derive maximum likelihood estimation and connect it to common loss functions
  4. Distinguish frequentist and Bayesian approaches to inference and know when each is appropriate
  5. Apply the central limit theorem, law of large numbers, and concentration inequalities to reason about estimator behavior
  6. Implement Monte Carlo methods for estimation and integration

3.1 Why Probability Is the Language of Machine Learning

Every machine learning model makes predictions under uncertainty. A click-prediction model at StreamRec assigns $P(\text{click} \mid \text{user}, \text{item}) = 0.12$. A clinical model at MediCore Pharmaceuticals estimates $P(\text{adverse event} \mid \text{drug}, \text{patient history})$. A climate model from the Pacific Climate Research Consortium outputs a distribution over 2050 global temperatures rather than a point estimate. In every case, the model speaks the language of probability.

But probability in ML is not just notation. The choice of probability distribution determines your loss function. The distinction between frequentist and Bayesian inference determines how you quantify uncertainty. The exponential family structure determines which models admit closed-form solutions. Understanding these connections — understanding why — is what separates the practitioner who uses cross-entropy loss because "it works" from the practitioner who derives cross-entropy loss from first principles and recognizes when a different loss function would be more appropriate.

In Chapter 2, we derived the optimization machinery that finds model parameters. This chapter answers the prior question: what objective should we be optimizing, and why?

Common Misconception: "Probability theory is just a review topic — I learned it in my first statistics course." The probability covered in introductory courses (computing $P(A \cup B)$, working with the normal distribution) is necessary but insufficient. This chapter develops the machinery of exponential families, maximum likelihood estimation, Fisher information, and Monte Carlo methods — the mathematical infrastructure that every ML algorithm is built on.


3.2 Probability Spaces: The Formal Foundation

A probability space is a triple $(\Omega, \mathcal{F}, P)$:

  • $\Omega$ is the sample space — the set of all possible outcomes
  • $\mathcal{F}$ is a $\sigma$-algebra — a collection of subsets of $\Omega$ (the "events" we can assign probabilities to)
  • $P: \mathcal{F} \to [0, 1]$ is a probability measure satisfying the Kolmogorov axioms

The three Kolmogorov axioms are:

  1. Non-negativity: $P(A) \geq 0$ for all $A \in \mathcal{F}$
  2. Normalization: $P(\Omega) = 1$
  3. Countable additivity: For mutually exclusive events $A_1, A_2, \ldots$, $P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i)$

From these three axioms, everything else follows: the addition rule, the complement rule, the inclusion-exclusion principle, and ultimately the entire theory of probability distributions, estimation, and inference.

Mathematical Foundation: The $\sigma$-algebra $\mathcal{F}$ may seem like a technicality, but it matters in practice. When we work with continuous random variables (prediction scores, model parameters, temperature projections), not every subset of $\mathbb{R}$ can be assigned a probability. The $\sigma$-algebra restricts us to "measurable" sets. In ML, we almost always work with the Borel $\sigma$-algebra on $\mathbb{R}^n$, which includes every set you would construct in practice. But this subtlety explains why measure theory exists and why some probability results require careful statement.

Random Variables

A random variable $X: \Omega \to \mathbb{R}$ maps outcomes to real numbers. It is the bridge between the abstract probability space and the concrete quantities we compute with. In ML, random variables represent everything: feature values, model predictions, labels, noise, and parameters (in the Bayesian framework).

For a discrete random variable, the probability mass function (PMF) is:

$$p_X(x) = P(X = x)$$

For a continuous random variable, the probability density function (PDF) $f_X(x)$ satisfies:

$$P(a \leq X \leq b) = \int_a^b f_X(x) \, dx$$

The cumulative distribution function (CDF) unifies both cases:

$$F_X(x) = P(X \leq x)$$

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Discrete random variable: Bernoulli (click/no-click)
p_click = 0.12
bernoulli_rv = stats.bernoulli(p=p_click)
print(f"P(click=1) = {bernoulli_rv.pmf(1):.4f}")
print(f"P(click=0) = {bernoulli_rv.pmf(0):.4f}")

# Continuous random variable: Gaussian (prediction scores)
mu, sigma = 0.5, 0.15
gaussian_rv = stats.norm(loc=mu, scale=sigma)
x = np.linspace(0, 1, 1000)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(x, gaussian_rv.pdf(x), 'b-', linewidth=2)
axes[0].set_title("PDF: Model Score Distribution")
axes[0].set_xlabel("Score")

axes[1].plot(x, gaussian_rv.cdf(x), 'r-', linewidth=2)
axes[1].set_title("CDF: P(Score ≤ x)")
axes[1].set_xlabel("Score")
plt.tight_layout()

3.3 Joint, Marginal, and Conditional Distributions

In ML, we rarely work with a single variable. A recommendation system models the joint behavior of users and items. A causal inference study models the relationship between treatment and outcome conditional on confounders.

Joint Distributions

The joint distribution of two random variables $X$ and $Y$ captures their simultaneous behavior:

  • Discrete: $p_{X,Y}(x, y) = P(X = x, Y = y)$
  • Continuous: $f_{X,Y}(x, y)$ such that $P((X,Y) \in A) = \iint_A f_{X,Y}(x,y) \, dx \, dy$

Marginal Distributions

The marginal distribution recovers a single variable's distribution by summing (or integrating) out the other:

$$p_X(x) = \sum_y p_{X,Y}(x, y) \qquad \text{(discrete)}$$

$$f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dy \qquad \text{(continuous)}$$

This operation — marginalizing — is fundamental. When a climate model outputs a joint distribution over temperature and precipitation, and a policymaker asks only about temperature, we marginalize out precipitation.

Conditional Distributions

The conditional distribution of $Y$ given $X = x$ is:

$$p_{Y|X}(y \mid x) = \frac{p_{X,Y}(x, y)}{p_X(x)}, \quad \text{provided } p_X(x) > 0$$

This is the workhorse of supervised learning. A classifier models $P(Y \mid X)$: the distribution of labels given features. A generative model decomposes the joint as $p(x, y) = p(y \mid x) \cdot p(x)$.

Independence

$X$ and $Y$ are independent if and only if $p_{X,Y}(x, y) = p_X(x) \cdot p_Y(y)$ for all $x, y$. Conditional independence — $X \perp\!\!\!\perp Y \mid Z$ — means $p(x, y \mid z) = p(x \mid z) \cdot p(y \mid z)$. Conditional independence is the structural assumption behind graphical models, naive Bayes classifiers, and the causal DAGs we will study in Part III.

import numpy as np

# Simulating joint, marginal, and conditional distributions
# Example: User engagement (X = session length) and content type (Y = category)
np.random.seed(42)

# Joint distribution as a 2D histogram
n_users = 50_000
# Category: 0=articles, 1=videos, 2=podcasts
category = np.random.choice([0, 1, 2], size=n_users, p=[0.4, 0.35, 0.25])

# Session length depends on category (conditional distribution)
session_length = np.where(
    category == 0,
    np.random.exponential(scale=5.0, size=n_users),   # articles: shorter
    np.where(
        category == 1,
        np.random.exponential(scale=12.0, size=n_users),  # videos: longer
        np.random.exponential(scale=8.0, size=n_users)    # podcasts: medium
    )
)

# Marginal distribution of session length
print(f"Marginal E[session_length] = {session_length.mean():.2f} min")

# Conditional distributions
for cat, name in enumerate(["articles", "videos", "podcasts"]):
    mask = category == cat
    print(f"E[session_length | category={name}] = {session_length[mask].mean():.2f} min")

# Test independence: if X ⊥ Y, then P(X|Y=y) should not depend on y
# Here they are NOT independent — session length depends on category

3.4 Bayes' Theorem: The Engine of Learning

Bayes' theorem follows directly from the definition of conditional probability:

$$P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}$$

In the context of statistical inference, where $\theta$ represents parameters and $D$ represents observed data:

$$\underbrace{p(\theta \mid D)}_{\text{posterior}} = \frac{\overbrace{p(D \mid \theta)}^{\text{likelihood}} \cdot \overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(D)}_{\text{evidence}}}$$

This equation is arguably the single most important formula in data science. Every term has a clear interpretation:

  • Prior $p(\theta)$: What we believe about $\theta$ before seeing data. In the MediCore clinical trial, this encodes existing medical knowledge about plausible treatment effect sizes.
  • Likelihood $p(D \mid \theta)$: How probable the observed data is under each parameter value. This is the function we optimize in maximum likelihood estimation.
  • Posterior $p(\theta \mid D)$: Our updated belief about $\theta$ after observing data. This is what we ultimately want.
  • Evidence $p(D) = \int p(D \mid \theta) p(\theta) \, d\theta$: A normalizing constant ensuring the posterior integrates to 1. This integral is often intractable, which motivates the Monte Carlo methods in Section 3.10.

Research Insight: The Bayesian interpretation of learning — start with a prior, observe data, update to a posterior — is not just a philosophical position. It is the mathematical framework underlying regularization ($L_2$ regularization = Gaussian prior on weights), dropout (approximate Bayesian inference, Gal and Ghahramani, 2016), and the entire field of Bayesian deep learning. Understanding Bayes' theorem at this level is prerequisite to Part IV of this book.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

def bayesian_update_beta_binomial(
    prior_alpha: float,
    prior_beta: float,
    n_successes: int,
    n_trials: int,
) -> tuple[float, float]:
    """Update Beta prior with Binomial likelihood (conjugate update).

    Args:
        prior_alpha: Alpha parameter of Beta prior.
        prior_beta: Beta parameter of Beta prior.
        n_successes: Number of observed successes.
        n_trials: Total number of trials.

    Returns:
        Tuple of (posterior_alpha, posterior_beta).
    """
    posterior_alpha = prior_alpha + n_successes
    posterior_beta = prior_beta + (n_trials - n_successes)
    return posterior_alpha, posterior_beta

# MediCore Pharmaceuticals example: estimating drug response rate
# Prior: clinical experts believe response rate is around 0.3 (±0.1)
# We encode this as Beta(6, 14) — E[p] = 6/20 = 0.3, moderate confidence
prior_a, prior_b = 6.0, 14.0

# Observational data: 47 responders out of 120 patients
n_successes, n_trials = 47, 120

post_a, post_b = bayesian_update_beta_binomial(prior_a, prior_b, n_successes, n_trials)

theta = np.linspace(0, 1, 1000)
fig, ax = plt.subplots(figsize=(10, 5))

ax.plot(theta, stats.beta.pdf(theta, prior_a, prior_b),
        'b--', linewidth=2, label=f"Prior: Beta({prior_a}, {prior_b})")
ax.plot(theta, stats.beta.pdf(theta, post_a, post_b),
        'r-', linewidth=2, label=f"Posterior: Beta({post_a}, {post_b})")
ax.axvline(n_successes / n_trials, color='green', linestyle=':',
           label=f"MLE = {n_successes / n_trials:.3f}")

ax.set_xlabel(r"Response rate $\theta$")
ax.set_ylabel("Density")
ax.set_title("Bayesian Update: MediCore Drug Response Rate")
ax.legend()

# 95% credible interval
ci_low, ci_high = stats.beta.ppf([0.025, 0.975], post_a, post_b)
print(f"Prior mean: {prior_a / (prior_a + prior_b):.3f}")
print(f"MLE: {n_successes / n_trials:.3f}")
print(f"Posterior mean: {post_a / (post_a + post_b):.3f}")
print(f"95% credible interval: [{ci_low:.3f}, {ci_high:.3f}]")

3.5 Probability Distributions and Their ML Roles

Every machine learning model implicitly or explicitly assumes a probability distribution over its inputs, outputs, or parameters. This section catalogs the distributions that appear most frequently, connecting each one to specific ML applications.

The Bernoulli Distribution

The simplest non-trivial distribution: $X \in \{0, 1\}$ with $P(X = 1) = p$.

$$p(x \mid p) = p^x (1-p)^{1-x}$$

ML connection: Binary classification. When your click-prediction model at StreamRec outputs $P(\text{click}) = 0.12$, it is modeling the label as a Bernoulli random variable. The loss function for training this model — binary cross-entropy — is the negative log-likelihood of the Bernoulli distribution. We derive this precisely in Section 3.7.

The Categorical Distribution

The Bernoulli generalized to $K$ categories: $X \in \{1, 2, \ldots, K\}$ with $P(X = k) = p_k$, where $\sum_k p_k = 1$.

$$p(x \mid \mathbf{p}) = \prod_{k=1}^{K} p_k^{\mathbb{1}[x=k]}$$

ML connection: Multi-class classification and language modeling. When a language model predicts the next token from a vocabulary of 50,000 words, it outputs a categorical distribution. The softmax function converts a vector of logits to categorical probabilities: $p_k = \frac{e^{z_k}}{\sum_j e^{z_j}}$. The categorical cross-entropy loss — the standard loss for classification — is the negative log-likelihood of this distribution.

The Gaussian (Normal) Distribution

$$f(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$$

ML connections: - Regression: Assuming Gaussian noise on the target variable leads to the mean squared error loss. Specifically, if $y = f(x) + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \sigma^2)$, then the MLE for $f$ minimizes $\sum_i (y_i - f(x_i))^2$. - Latent spaces: Variational autoencoders enforce $\mathcal{N}(0, I)$ on the latent space. - Bayesian priors: A Gaussian prior on neural network weights is equivalent to $L_2$ regularization.

The multivariate Gaussian over $\mathbf{x} \in \mathbb{R}^d$:

$$f(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)$$

The eigenvectors of $\boldsymbol{\Sigma}$ are the principal components (Chapter 1). The Mahalanobis distance $(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})$ is the natural distance metric for Gaussian data.

The Poisson Distribution

$$p(k \mid \lambda) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots$$

ML connection: Count data — the number of clicks per session, the number of API errors per hour, the number of adverse events per patient-year. When StreamRec models "how many items will this user view in a session," a Poisson distribution is the natural starting point. Poisson regression (log-linear model) is the standard approach for count outcomes.

The Exponential Distribution

$$f(x \mid \lambda) = \lambda e^{-\lambda x}, \quad x \geq 0$$

ML connection: Modeling inter-event times — the time between clicks, the time between server failures, the time to churn. It is the continuous counterpart of the geometric distribution and has the memoryless property: $P(X > s + t \mid X > s) = P(X > t)$.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# A visual catalog of distributions in ML
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# Bernoulli
p = 0.3
axes[0, 0].bar([0, 1], [1-p, p], color=['steelblue', 'coral'], width=0.4)
axes[0, 0].set_title("Bernoulli(p=0.3)\nBinary Classification")
axes[0, 0].set_xticks([0, 1])
axes[0, 0].set_xticklabels(["No Click", "Click"])

# Categorical
probs = [0.05, 0.15, 0.3, 0.25, 0.15, 0.10]
axes[0, 1].bar(range(6), probs, color='steelblue')
axes[0, 1].set_title("Categorical(K=6)\nMulti-class / Next Token")
axes[0, 1].set_xlabel("Category")

# Gaussian
x = np.linspace(-4, 4, 300)
axes[0, 2].plot(x, stats.norm.pdf(x), 'b-', linewidth=2)
axes[0, 2].fill_between(x, stats.norm.pdf(x), alpha=0.2)
axes[0, 2].set_title("Gaussian(μ=0, σ=1)\nRegression / Latent Spaces")

# Poisson
k = np.arange(0, 20)
axes[1, 0].bar(k, stats.poisson.pmf(k, mu=5), color='steelblue')
axes[1, 0].set_title("Poisson(λ=5)\nCount Data")
axes[1, 0].set_xlabel("Count")

# Exponential
x = np.linspace(0, 5, 300)
axes[1, 1].plot(x, stats.expon.pdf(x, scale=1.0), 'b-', linewidth=2)
axes[1, 1].fill_between(x, stats.expon.pdf(x, scale=1.0), alpha=0.2)
axes[1, 1].set_title("Exponential(λ=1)\nInter-event Times")

# Beta (for Bayesian inference)
theta = np.linspace(0, 1, 300)
for a, b, label in [(2, 5, "α=2,β=5"), (5, 5, "α=5,β=5"), (8, 2, "α=8,β=2")]:
    axes[1, 2].plot(theta, stats.beta.pdf(theta, a, b), linewidth=2, label=label)
axes[1, 2].set_title("Beta(α, β)\nBayesian Priors / Proportions")
axes[1, 2].legend(fontsize=8)

plt.tight_layout()

3.6 The Exponential Family: A Unifying Framework

The distributions above are not isolated. They are all members of the exponential family, a unifying framework that explains why so many ML algorithms have the same structure.

A distribution belongs to the exponential family if it can be written as:

$$p(x \mid \boldsymbol{\eta}) = h(x) \exp\left(\boldsymbol{\eta}^\top \mathbf{T}(x) - A(\boldsymbol{\eta})\right)$$

where: - $\boldsymbol{\eta}$ are the natural parameters - $\mathbf{T}(x)$ is the sufficient statistic — the function of the data that contains all information about $\boldsymbol{\eta}$ - $A(\boldsymbol{\eta})$ is the log-partition function (ensures normalization) - $h(x)$ is the base measure

Why the Exponential Family Matters for ML

  1. Sufficient statistics: $\mathbf{T}(x)$ captures everything the data tells us about the parameters. For a Gaussian, $T(x) = (x, x^2)$ — the sample mean and sample variance are sufficient. This means we can compress $n$ data points into a fixed-size summary without losing any information about the parameters.

  2. Conjugate priors: For every exponential family likelihood, there exists a prior distribution such that the posterior has the same functional form as the prior. This enables closed-form Bayesian inference. The Beta-Binomial conjugacy we used in Section 3.4 is an instance of this general pattern.

  3. Maximum likelihood has a nice form: The MLE for an exponential family sets the expected sufficient statistic equal to the observed sufficient statistic: $\mathbb{E}_{\hat{\boldsymbol{\eta}}}[\mathbf{T}(X)] = \frac{1}{n}\sum_{i=1}^n \mathbf{T}(x_i)$. This unifies the sample mean (Gaussian), sample proportion (Bernoulli), and sample rate (Poisson) as instances of a single principle.

  4. Generalized linear models (GLMs): The exponential family is the theoretical foundation of GLMs (logistic regression, Poisson regression, etc.), which link a linear predictor to the natural parameter.

Mathematical Foundation: The log-partition function $A(\boldsymbol{\eta})$ is remarkably powerful. Its derivatives generate the moments of the distribution: $$\frac{\partial A}{\partial \eta_i} = \mathbb{E}[T_i(X)], \qquad \frac{\partial^2 A}{\partial \eta_i \partial \eta_j} = \text{Cov}(T_i(X), T_j(X))$$ The second derivative is always positive semidefinite, which means $A(\boldsymbol{\eta})$ is convex. This convexity guarantees that the negative log-likelihood is also convex — the MLE is unique and gradient descent will find it.

Mapping Distributions to the Exponential Family

Distribution Natural parameter $\eta$ Sufficient statistic $T(x)$ Log-partition $A(\eta)$
Bernoulli($p$) $\log\frac{p}{1-p}$ (logit) $x$ $\log(1 + e^\eta)$
Gaussian($\mu$, known $\sigma^2$) $\mu / \sigma^2$ $x$ $\eta^2 \sigma^2 / 2$
Poisson($\lambda$) $\log \lambda$ $x$ $e^\eta$
Exponential($\lambda$) $-\lambda$ $x$ $-\log(-\eta)$
Categorical($\mathbf{p}$) $\log p_k / p_K$ one-hot vector $\log\sum_k e^{\eta_k}$ (logsumexp)

Notice the last row: the log-partition function of the categorical distribution is the logsumexp function, and the mapping from natural parameters to probabilities is the softmax function. This is not a coincidence — it is why softmax is the "natural" output activation for multi-class classification.

import numpy as np

def exponential_family_bernoulli(x: np.ndarray, eta: float) -> np.ndarray:
    """Compute Bernoulli log-probability in exponential family form.

    Args:
        x: Array of observations (0 or 1).
        eta: Natural parameter (log-odds).

    Returns:
        Array of log-probabilities.
    """
    T_x = x                         # sufficient statistic
    A_eta = np.log(1 + np.exp(eta)) # log-partition function
    # h(x) = 1 for Bernoulli, so log h(x) = 0
    return eta * T_x - A_eta

def exponential_family_gaussian(
    x: np.ndarray, eta: float, sigma_sq: float = 1.0
) -> np.ndarray:
    """Compute Gaussian log-probability in exponential family form (known variance).

    Args:
        x: Array of observations.
        eta: Natural parameter (mu / sigma^2).
        sigma_sq: Known variance.

    Returns:
        Array of log-probabilities.
    """
    T_x = x
    A_eta = 0.5 * eta**2 * sigma_sq
    log_h_x = -0.5 * (np.log(2 * np.pi * sigma_sq) + x**2 / sigma_sq)
    return eta * T_x - A_eta + log_h_x

# Verify: Bernoulli with p=0.3
p = 0.3
eta = np.log(p / (1 - p))  # logit transform
x = np.array([0, 1])
log_probs_ef = exponential_family_bernoulli(x, eta)
log_probs_direct = np.log(np.array([1 - p, p]))
print("Exponential family:", np.exp(log_probs_ef))
print("Direct computation:", np.exp(log_probs_direct))
assert np.allclose(log_probs_ef, log_probs_direct), "Mismatch!"
print("Verified: exponential family form matches direct computation")

3.7 Maximum Likelihood Estimation: The Centerpiece

Maximum likelihood estimation (MLE) is the most important estimation principle in machine learning. Nearly every loss function you have used — mean squared error, binary cross-entropy, categorical cross-entropy — is the negative log-likelihood of a probability distribution. Understanding this connection transforms loss functions from arbitrary choices into principled consequences of modeling assumptions.

The MLE Principle

Given $n$ independent observations $\mathbf{x} = (x_1, x_2, \ldots, x_n)$ drawn from a distribution $p(x \mid \theta)$, the likelihood function is:

$$\mathcal{L}(\theta) = \prod_{i=1}^n p(x_i \mid \theta)$$

The log-likelihood (more numerically stable and mathematically convenient) is:

$$\ell(\theta) = \sum_{i=1}^n \log p(x_i \mid \theta)$$

The maximum likelihood estimate is:

$$\hat{\theta}_{\text{MLE}} = \arg\max_\theta \ell(\theta) = \arg\max_\theta \sum_{i=1}^n \log p(x_i \mid \theta)$$

Equivalently, minimizing the negative log-likelihood:

$$\hat{\theta}_{\text{MLE}} = \arg\min_\theta \left[ -\sum_{i=1}^n \log p(x_i \mid \theta) \right]$$

Derivation: MLE for the Bernoulli Model

This derivation connects directly to the StreamRec progressive project. Consider $n$ user interactions, each with a binary outcome: click ($x_i = 1$) or no click ($x_i = 0$). Under a Bernoulli model, $P(X = 1) = p$.

Step 1: Write the likelihood.

$$\mathcal{L}(p) = \prod_{i=1}^n p^{x_i}(1-p)^{1-x_i}$$

Step 2: Take the log.

$$\ell(p) = \sum_{i=1}^n \left[ x_i \log p + (1 - x_i) \log(1-p) \right]$$

Step 3: Take the derivative and set to zero.

$$\frac{d\ell}{dp} = \sum_{i=1}^n \left[ \frac{x_i}{p} - \frac{1-x_i}{1-p} \right] = 0$$

$$\frac{\sum_i x_i}{p} = \frac{n - \sum_i x_i}{1-p}$$

$$(1-p) \sum_i x_i = p(n - \sum_i x_i)$$

$$\sum_i x_i - p \sum_i x_i = pn - p \sum_i x_i$$

$$\hat{p}_{\text{MLE}} = \frac{\sum_{i=1}^n x_i}{n} = \bar{x}$$

The MLE for the Bernoulli parameter is the sample proportion. This is both intuitive and rigorous.

Step 4: Verify it is a maximum (second derivative test).

$$\frac{d^2\ell}{dp^2} = -\sum_{i=1}^n \left[ \frac{x_i}{p^2} + \frac{1-x_i}{(1-p)^2} \right] < 0$$

The second derivative is strictly negative for $p \in (0, 1)$, confirming a maximum.

The Critical Connection: Cross-Entropy IS Negative Log-Likelihood

Now consider a classifier that outputs $\hat{p}_i = f_\theta(x_i)$ — a predicted probability of class 1 for each sample. The binary cross-entropy loss is:

$$\mathcal{L}_{\text{BCE}} = -\frac{1}{n}\sum_{i=1}^n \left[ y_i \log \hat{p}_i + (1 - y_i) \log(1 - \hat{p}_i) \right]$$

Compare this to the Bernoulli negative log-likelihood (NLL) where we allow $p$ to vary per sample:

$$-\frac{1}{n} \ell(\theta) = -\frac{1}{n} \sum_{i=1}^n \left[ y_i \log f_\theta(x_i) + (1 - y_i) \log(1 - f_\theta(x_i)) \right]$$

They are identical. Binary cross-entropy loss is the negative log-likelihood of a Bernoulli model. Minimizing cross-entropy is maximum likelihood estimation.

This extends to the multi-class case. Categorical cross-entropy:

$$\mathcal{L}_{\text{CCE}} = -\frac{1}{n}\sum_{i=1}^n \sum_{k=1}^K y_{ik} \log \hat{p}_{ik}$$

is the negative log-likelihood of a categorical model, where $\hat{p}_{ik} = \text{softmax}(z_i)_k$.

And mean squared error:

$$\mathcal{L}_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2$$

is the negative log-likelihood (up to a constant) of a Gaussian model $y_i \sim \mathcal{N}(\hat{y}_i, \sigma^2)$.

Common Misconception: "I chose MSE for regression and cross-entropy for classification because that is what everyone uses." This is correct practice but wrong reasoning. The right reasoning: MSE assumes Gaussian noise on the target; cross-entropy assumes a Bernoulli (or categorical) target. If your regression target has heavy-tailed noise, MSE (Gaussian assumption) is the wrong loss — you should use a loss derived from a more appropriate distribution (e.g., Laplace distribution yields $L_1$ loss). The choice of loss function is a modeling assumption about the data-generating process.

import numpy as np
import torch
import torch.nn.functional as F

def bernoulli_nll_manual(
    y_true: np.ndarray, y_pred: np.ndarray, eps: float = 1e-12
) -> float:
    """Compute Bernoulli negative log-likelihood manually.

    Args:
        y_true: Binary labels (0 or 1), shape (n,).
        y_pred: Predicted probabilities, shape (n,).
        eps: Small constant for numerical stability.

    Returns:
        Mean negative log-likelihood.
    """
    y_pred = np.clip(y_pred, eps, 1 - eps)
    nll = -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    return nll.mean()

# Generate synthetic data
np.random.seed(42)
n = 1000
y_true = np.random.binomial(1, 0.3, size=n)
y_pred = np.random.beta(2, 5, size=n)  # simulated model outputs

# Our manual NLL
nll_manual = bernoulli_nll_manual(y_true, y_pred)

# PyTorch's BCE loss
y_true_t = torch.tensor(y_true, dtype=torch.float32)
y_pred_t = torch.tensor(y_pred, dtype=torch.float32)
bce_pytorch = F.binary_cross_entropy(y_pred_t, y_true_t).item()

print(f"Manual Bernoulli NLL:  {nll_manual:.6f}")
print(f"PyTorch BCE loss:      {bce_pytorch:.6f}")
print(f"Difference:            {abs(nll_manual - bce_pytorch):.2e}")
# They match — because BCE IS the Bernoulli NLL

MLE for the Gaussian: Deriving MSE

For completeness, we derive the MLE for the Gaussian to show that MSE falls out naturally.

Given $x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2)$:

$$\ell(\mu, \sigma^2) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log \sigma^2 - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2$$

Taking $\frac{\partial \ell}{\partial \mu} = 0$:

$$\frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) = 0 \implies \hat{\mu}_{\text{MLE}} = \bar{x}$$

Taking $\frac{\partial \ell}{\partial \sigma^2} = 0$:

$$-\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n (x_i - \mu)^2 = 0 \implies \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2$$

When we use a neural network $f_\theta(x_i)$ as the mean parameter, the MLE reduces to minimizing $\sum_i (y_i - f_\theta(x_i))^2$ — which is MSE.


3.8 Maximum A Posteriori (MAP) Estimation and Regularization

The maximum a posteriori estimate incorporates a prior:

$$\hat{\theta}_{\text{MAP}} = \arg\max_\theta \left[ \log p(D \mid \theta) + \log p(\theta) \right]$$

MAP is a bridge between MLE and full Bayesian inference. It uses the prior but returns a point estimate rather than a full posterior distribution.

Regularization as a Prior

The connection between MAP estimation and regularization is one of the most illuminating results in ML theory:

$L_2$ regularization = Gaussian prior. If $p(\theta) = \mathcal{N}(0, \sigma_0^2 I)$, then:

$$\log p(\theta) = -\frac{\|\theta\|_2^2}{2\sigma_0^2} + \text{const}$$

The MAP objective becomes:

$$\hat{\theta}_{\text{MAP}} = \arg\min_\theta \left[ -\ell(\theta) + \frac{\lambda}{2}\|\theta\|_2^2 \right]$$

where $\lambda = 1/\sigma_0^2$. This is exactly the $L_2$-regularized loss that every ML practitioner uses. The regularization strength $\lambda$ controls the precision (inverse variance) of the prior — a large $\lambda$ means a tight prior concentrated at zero, expressing a strong belief that parameters should be small.

$L_1$ regularization = Laplace prior. If $p(\theta_j) \propto \exp(-|\theta_j| / b)$, then:

$$\log p(\theta) = -\frac{1}{b}\|\theta\|_1 + \text{const}$$

The Laplace prior's sharp peak at zero explains why $L_1$ regularization promotes sparsity — the prior actively pushes parameters toward exactly zero, and many remain there.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

theta = np.linspace(-4, 4, 1000)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# L2 regularization = Gaussian prior
for sigma in [0.5, 1.0, 2.0]:
    lam = 1.0 / sigma**2
    axes[0].plot(theta, stats.norm.pdf(theta, 0, sigma),
                 label=f"σ={sigma} (λ={lam:.1f})")
axes[0].set_title("Gaussian Prior → L₂ Regularization")
axes[0].set_xlabel(r"$\theta$")
axes[0].legend()

# L1 regularization = Laplace prior
for b in [0.5, 1.0, 2.0]:
    axes[1].plot(theta, stats.laplace.pdf(theta, 0, b),
                 label=f"b={b}")
axes[1].set_title("Laplace Prior → L₁ Regularization (Sparsity)")
axes[1].set_xlabel(r"$\theta$")
axes[1].legend()

plt.tight_layout()

Production Reality: In practice, the MAP interpretation of regularization is more than a theoretical curiosity. When you choose $\lambda$ via cross-validation, you are implicitly selecting the prior variance that best explains the validation data. When you use weight decay in Adam (AdamW), you are applying a Gaussian prior that is decoupled from the adaptive learning rate — and this decoupling matters because standard Adam applies the prior incorrectly (Loshchilov and Hutter, 2019).


3.9 Fisher Information and the Cramér-Rao Bound

Fisher Information

The Fisher information measures how much information the data carries about a parameter $\theta$. Formally:

$$I(\theta) = \mathbb{E}\left[\left(\frac{\partial \log p(X \mid \theta)}{\partial \theta}\right)^2\right] = -\mathbb{E}\left[\frac{\partial^2 \log p(X \mid \theta)}{\partial \theta^2}\right]$$

The two expressions are equivalent under regularity conditions. Intuitively, Fisher information is high when the log-likelihood is sharply curved around the true parameter — meaning the data is highly informative about $\theta$.

For a vector parameter $\boldsymbol{\theta} \in \mathbb{R}^d$, the Fisher information matrix is:

$$[\mathbf{I}(\boldsymbol{\theta})]_{ij} = -\mathbb{E}\left[\frac{\partial^2 \log p(X \mid \boldsymbol{\theta})}{\partial \theta_i \partial \theta_j}\right]$$

This is the expected negative Hessian of the log-likelihood — connecting directly to the Hessian analysis from Chapter 2.

The Cramér-Rao Lower Bound

For any unbiased estimator $\hat{\theta}$ of $\theta$:

$$\text{Var}(\hat{\theta}) \geq \frac{1}{n \cdot I(\theta)}$$

This is a fundamental limit: no unbiased estimator can have lower variance than $1 / (n \cdot I(\theta))$. An estimator that achieves this bound is called efficient.

The MLE is asymptotically efficient: as $n \to \infty$, $\hat{\theta}_{\text{MLE}} \sim \mathcal{N}\left(\theta, \frac{1}{n \cdot I(\theta)}\right)$. This means the MLE extracts the maximum possible information from the data — another reason why MLE (and its equivalent loss functions) are the default in ML.

import numpy as np

def fisher_information_bernoulli(p: float) -> float:
    """Compute Fisher information for a Bernoulli(p) model.

    I(p) = 1 / (p * (1-p))

    Args:
        p: Bernoulli parameter.

    Returns:
        Fisher information.
    """
    return 1.0 / (p * (1 - p))

def cramer_rao_bound(n: int, p: float) -> float:
    """Compute Cramér-Rao lower bound for Bernoulli MLE variance.

    Args:
        n: Sample size.
        p: True parameter.

    Returns:
        Lower bound on variance of any unbiased estimator.
    """
    return 1.0 / (n * fisher_information_bernoulli(p))

# StreamRec: How many observations do we need for precise CTR estimates?
true_ctr = 0.05  # 5% click-through rate
for n in [100, 1_000, 10_000, 100_000]:
    crb = cramer_rao_bound(n, true_ctr)
    std = np.sqrt(crb)
    # 95% CI width ≈ 2 * 1.96 * std
    ci_width = 2 * 1.96 * std
    print(f"n={n:>7,d}: CR bound std = {std:.6f}, "
          f"95% CI width = {ci_width:.6f} "
          f"({ci_width/true_ctr*100:.1f}% of CTR)")

Research Insight: Fisher information appears throughout modern ML. Natural gradient descent (Amari, 1998) preconditions the gradient by the inverse Fisher information matrix, making optimization invariant to the parameterization of the model. The Fisher information matrix of a neural network is intractable to compute exactly, but approximations (K-FAC, empirical Fisher) are used in second-order optimizers and in continual learning (Elastic Weight Consolidation uses the diagonal Fisher to identify "important" parameters).


3.10 Frequentist vs. Bayesian Inference: A Practical Guide

The distinction between frequentist and Bayesian inference is not merely philosophical — it has practical consequences for how you build systems, quantify uncertainty, and communicate results.

The Frequentist Framework

In frequentist inference, parameters $\theta$ are fixed (but unknown) constants. Probability statements are about the data — specifically, about hypothetical repeated sampling.

Confidence intervals have a specific, often misunderstood interpretation. A 95% confidence interval means: if we repeated the experiment many times, 95% of the constructed intervals would contain the true parameter. It does not mean "there is a 95% probability that $\theta$ is in this interval." The true $\theta$ is either in the interval or it is not — there is no probability about it in the frequentist framework.

Key tools: Hypothesis testing ($p$-values, test statistics), confidence intervals, maximum likelihood estimation, the bootstrap.

The Bayesian Framework

In Bayesian inference, parameters $\theta$ are random variables with distributions. We begin with a prior $p(\theta)$, observe data $D$, and compute the posterior $p(\theta \mid D)$ via Bayes' theorem.

Credible intervals have a more intuitive interpretation: a 95% credible interval $[a, b]$ means $P(a \leq \theta \leq b \mid D) = 0.95$. Given the observed data and the prior, there is a 95% probability that $\theta$ falls in this interval.

Key tools: Bayes' theorem, conjugate priors, MCMC, variational inference, posterior predictive checks.

When to Use Which

The right question is not "which philosophy is correct?" but "which gives a more useful answer for this specific problem?"

Criterion Frequentist Bayesian
Prior knowledge available Less natural to incorporate Natural: encode as prior
Small sample size Unreliable (CLT-based methods break down) Prior regularizes estimation
Computational cost Usually cheap (closed-form or fast algorithms) Can be expensive (MCMC, VI)
Interpretability of uncertainty Confidence intervals: awkward Credible intervals: intuitive
Regulatory setting Often required (FDA, clinical trials) Growing acceptance
Sequential updating Requires correction (multiple testing) Natural: today's posterior is tomorrow's prior
High-dimensional parameters MLE works well Prior essential for regularization

Common Misconception: "Bayesian methods are always better because they give you a full posterior." Not always. When you have millions of data points (as in most deep learning applications), the prior is overwhelmed by the likelihood, and the posterior concentrates around the MLE. In this regime, the computational cost of Bayesian inference is not justified by meaningfully different conclusions. Bayesian methods shine when data is scarce, prior knowledge is strong, or uncertainty quantification is critical — exactly the settings in clinical trials (MediCore) and climate modeling (Pacific Climate Research Consortium).

import numpy as np
from scipy import stats

def frequentist_ci(
    successes: int, n: int, confidence: float = 0.95
) -> tuple[float, float]:
    """Compute Wald confidence interval for a proportion.

    Args:
        successes: Number of successes.
        n: Number of trials.
        confidence: Confidence level.

    Returns:
        Tuple of (lower, upper) confidence bounds.
    """
    p_hat = successes / n
    z = stats.norm.ppf(1 - (1 - confidence) / 2)
    se = np.sqrt(p_hat * (1 - p_hat) / n)
    return (p_hat - z * se, p_hat + z * se)

def bayesian_credible_interval(
    successes: int,
    n: int,
    prior_alpha: float = 1.0,
    prior_beta: float = 1.0,
    confidence: float = 0.95,
) -> tuple[float, float]:
    """Compute Bayesian credible interval using Beta-Binomial conjugacy.

    Args:
        successes: Number of successes.
        n: Number of trials.
        prior_alpha: Alpha parameter of Beta prior.
        prior_beta: Beta parameter of Beta prior.
        confidence: Credible level.

    Returns:
        Tuple of (lower, upper) credible bounds.
    """
    post_alpha = prior_alpha + successes
    post_beta = prior_beta + (n - successes)
    lower = stats.beta.ppf((1 - confidence) / 2, post_alpha, post_beta)
    upper = stats.beta.ppf(1 - (1 - confidence) / 2, post_alpha, post_beta)
    return (lower, upper)

# Compare: StreamRec CTR estimation with small and large samples
for n, k in [(20, 3), (200, 30), (2000, 300), (20000, 3000)]:
    freq_ci = frequentist_ci(k, n)
    # Informative prior: we believe CTR ≈ 0.12 based on historical data
    bayes_ci = bayesian_credible_interval(k, n, prior_alpha=12, prior_beta=88)
    # Uninformative prior
    bayes_ci_uninf = bayesian_credible_interval(k, n, prior_alpha=1, prior_beta=1)

    print(f"n={n:>5,d}, k={k:>4,d} (p̂={k/n:.3f})")
    print(f"  Frequentist 95% CI:                [{freq_ci[0]:.4f}, {freq_ci[1]:.4f}]")
    print(f"  Bayesian (informative prior) 95% CR: [{bayes_ci[0]:.4f}, {bayes_ci[1]:.4f}]")
    print(f"  Bayesian (uniform prior) 95% CR:     [{bayes_ci_uninf[0]:.4f}, {bayes_ci_uninf[1]:.4f}]")
    print()

3.11 Asymptotic Theory: CLT, LLN, and Concentration Inequalities

Understanding how estimators behave as sample size grows is essential for ML system design. How many samples do you need for a reliable A/B test? How quickly does SGD converge? When can you trust the normal approximation?

The Law of Large Numbers (LLN)

Let $X_1, X_2, \ldots$ be i.i.d. with $\mathbb{E}[X_i] = \mu$. The strong law of large numbers states:

$$\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{\text{a.s.}} \mu$$

In words: the sample mean converges to the population mean almost surely. This guarantees that Monte Carlo estimates, sample proportions, and empirical loss values converge to their true values with enough data.

The Central Limit Theorem (CLT)

Let $X_1, X_2, \ldots$ be i.i.d. with $\mathbb{E}[X_i] = \mu$ and $\text{Var}(X_i) = \sigma^2 < \infty$. Then:

$$\sqrt{n}\left(\bar{X}_n - \mu\right) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$$

Equivalently:

$$\bar{X}_n \stackrel{\text{approx.}}{\sim} \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right)$$

The CLT is the justification for Gaussian-based confidence intervals, $z$-tests, and the asymptotic normality of the MLE. It is also the reason that averaging over mini-batches in SGD produces approximately Gaussian gradient estimates.

import numpy as np
import matplotlib.pyplot as plt

def demonstrate_clt(
    distribution: str = "exponential",
    true_mean: float = 2.0,
    sample_sizes: list[int] | None = None,
    n_simulations: int = 10_000,
) -> None:
    """Demonstrate the Central Limit Theorem visually.

    Args:
        distribution: Base distribution ("exponential", "bernoulli", "poisson").
        true_mean: Mean of the distribution.
        sample_sizes: List of sample sizes to demonstrate.
        n_simulations: Number of simulated sample means.
    """
    if sample_sizes is None:
        sample_sizes = [1, 5, 30, 100]

    fig, axes = plt.subplots(1, len(sample_sizes), figsize=(16, 3.5))

    for ax, n in zip(axes, sample_sizes):
        means = []
        for _ in range(n_simulations):
            if distribution == "exponential":
                sample = np.random.exponential(scale=true_mean, size=n)
            elif distribution == "bernoulli":
                sample = np.random.binomial(1, true_mean, size=n)
            elif distribution == "poisson":
                sample = np.random.poisson(lam=true_mean, size=n)
            else:
                raise ValueError(f"Unknown distribution: {distribution}")
            means.append(sample.mean())

        means = np.array(means)
        ax.hist(means, bins=50, density=True, alpha=0.7, color='steelblue')

        # Overlay the CLT Gaussian approximation
        if distribution == "exponential":
            sigma_sq = true_mean**2  # Var of Exp(lambda) = 1/lambda^2
        elif distribution == "bernoulli":
            sigma_sq = true_mean * (1 - true_mean)
        elif distribution == "poisson":
            sigma_sq = true_mean

        x = np.linspace(means.min(), means.max(), 200)
        clt_pdf = (1 / np.sqrt(2 * np.pi * sigma_sq / n)) * np.exp(
            -0.5 * (x - true_mean)**2 / (sigma_sq / n)
        )
        ax.plot(x, clt_pdf, 'r-', linewidth=2, label='CLT approx.')
        ax.set_title(f"n = {n}")
        ax.legend(fontsize=8)

    fig.suptitle(
        f"CLT Demonstration: Sample Mean of {distribution.title()} "
        f"(μ={true_mean})",
        fontsize=12
    )
    plt.tight_layout()

# StreamRec: distribution of average session lengths (exponential base)
demonstrate_clt(distribution="exponential", true_mean=8.0)

Concentration Inequalities: Finite-Sample Guarantees

The CLT is an asymptotic result — it tells us what happens as $n \to \infty$. But in ML, we need finite-sample guarantees. How many users must be in an A/B test before we can detect a 2% improvement? Concentration inequalities provide non-asymptotic answers.

Hoeffding's inequality: For bounded i.i.d. random variables $X_i \in [a, b]$ with $\mathbb{E}[X_i] = \mu$:

$$P\left(|\bar{X}_n - \mu| \geq t\right) \leq 2 \exp\left(-\frac{2nt^2}{(b-a)^2}\right)$$

This is a finite-sample bound, valid for any $n$ and $t$. It does not require the CLT's Gaussian approximation.

Application to A/B testing at StreamRec: Suppose engagement rates are bounded in $[0, 1]$ and we want to detect a difference of $\delta = 0.02$ with probability at least $1 - \alpha = 0.95$. Hoeffding's inequality (applied to both groups) gives:

$$n \geq \frac{(b-a)^2 \log(2/\alpha)}{2\delta^2} = \frac{1 \cdot \log(40)}{2 \cdot 0.0004} \approx 4{,}617$$

per group. In practice, CLT-based power calculations give tighter bounds (~3,800 per group for this scenario), but Hoeffding's inequality provides a distribution-free guarantee.

import numpy as np

def hoeffding_sample_size(
    delta: float,
    alpha: float = 0.05,
    a: float = 0.0,
    b: float = 1.0,
) -> int:
    """Compute required sample size using Hoeffding's inequality.

    For detecting a difference of delta from the mean with
    probability at least 1 - alpha.

    Args:
        delta: Minimum detectable difference.
        alpha: Failure probability.
        a: Lower bound of random variable.
        b: Upper bound of random variable.

    Returns:
        Required sample size (per group).
    """
    return int(np.ceil(
        (b - a)**2 * np.log(2 / alpha) / (2 * delta**2)
    ))

def clt_sample_size(
    delta: float,
    sigma: float,
    alpha: float = 0.05,
    power: float = 0.80,
) -> int:
    """Compute required sample size using CLT-based power analysis.

    Two-sample z-test for difference in means.

    Args:
        delta: Minimum detectable difference.
        sigma: Standard deviation (assumed equal in both groups).
        alpha: Significance level.
        power: Desired statistical power.

    Returns:
        Required sample size (per group).
    """
    from scipy import stats
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    return int(np.ceil(2 * sigma**2 * (z_alpha + z_beta)**2 / delta**2))

# StreamRec A/B test: detecting a 2% engagement lift
delta = 0.02
sigma = 0.15  # estimated std of engagement metric

n_hoeffding = hoeffding_sample_size(delta, alpha=0.05)
n_clt = clt_sample_size(delta, sigma, alpha=0.05, power=0.80)

print(f"Hoeffding bound:  n ≥ {n_hoeffding:,} per group (distribution-free)")
print(f"CLT power calc:   n ≥ {n_clt:,} per group (assumes σ={sigma})")
print(f"Hoeffding is {n_hoeffding / n_clt:.1f}x more conservative")

Advanced Sidebar: Beyond Hoeffding, sharper inequalities exist for specific settings. Bernstein's inequality improves on Hoeffding when the variance is small relative to the range. McDiarmid's inequality extends to functions of independent random variables (not just sums). Matrix concentration inequalities (Tropp, 2012) handle random matrices — critical for analyzing random projections, compressed sensing, and random feature maps. These tools are the mathematical backbone of PAC learning theory and high-dimensional statistics.


3.12 Monte Carlo Methods: Computing the Intractable

Many quantities in ML and Bayesian inference involve integrals that have no closed-form solution:

  • The evidence $p(D) = \int p(D \mid \theta) p(\theta) \, d\theta$ in Bayesian model comparison
  • The posterior mean $\mathbb{E}[\theta \mid D] = \int \theta \, p(\theta \mid D) \, d\theta$
  • The expected loss under model uncertainty

Monte Carlo methods convert these integrals into averages over random samples.

Basic Monte Carlo Estimation

To estimate $I = \mathbb{E}_{p}[f(X)] = \int f(x) p(x) \, dx$:

  1. Draw $X_1, X_2, \ldots, X_N \sim p(x)$
  2. Compute $\hat{I}_N = \frac{1}{N}\sum_{i=1}^N f(X_i)$

By the LLN, $\hat{I}_N \to I$ as $N \to \infty$. By the CLT, the error is approximately $\mathcal{N}(0, \text{Var}(f(X)) / N)$ — the standard error decreases as $1/\sqrt{N}$.

import numpy as np

def monte_carlo_estimate(
    f: callable,
    sampler: callable,
    n_samples: int = 100_000,
) -> tuple[float, float]:
    """Estimate E[f(X)] using Monte Carlo sampling.

    Args:
        f: Function to compute expectation of.
        sampler: Function that returns n_samples from p(x).
        n_samples: Number of Monte Carlo samples.

    Returns:
        Tuple of (estimate, standard_error).
    """
    samples = sampler(n_samples)
    f_values = f(samples)
    estimate = np.mean(f_values)
    std_error = np.std(f_values, ddof=1) / np.sqrt(n_samples)
    return estimate, std_error

# Example: Estimate E[X^2] where X ~ Beta(2, 5)
# Analytical: E[X^2] = Var(X) + E[X]^2 = (a*b)/((a+b)^2*(a+b+1)) + (a/(a+b))^2
a, b = 2, 5
analytical = (a * b) / ((a + b)**2 * (a + b + 1)) + (a / (a + b))**2

estimate, se = monte_carlo_estimate(
    f=lambda x: x**2,
    sampler=lambda n: np.random.beta(a, b, size=n),
    n_samples=100_000,
)

print(f"Analytical E[X²]:   {analytical:.6f}")
print(f"Monte Carlo:        {estimate:.6f} ± {se:.6f}")
print(f"Error:              {abs(estimate - analytical):.6f}")

Importance Sampling

When we cannot sample from $p(x)$ directly, or when $p(x)$ concentrates in a region where $f(x)$ is small (leading to high variance), we can sample from a proposal distribution $q(x)$ instead:

$$\mathbb{E}_p[f(X)] = \int f(x) p(x) \, dx = \int f(x) \frac{p(x)}{q(x)} q(x) \, dx = \mathbb{E}_q\left[f(X) \frac{p(X)}{q(X)}\right]$$

The ratio $w(x) = p(x) / q(x)$ is called the importance weight. The estimator becomes:

$$\hat{I}_N = \frac{1}{N}\sum_{i=1}^N f(X_i) w(X_i), \quad X_i \sim q(x)$$

Importance sampling is powerful but fragile: if $q(x)$ assigns very low probability to regions where $p(x) f(x)$ is large, the importance weights become extreme and the variance explodes. A good proposal $q$ should have heavier tails than $p$ in regions where $f$ is large.

import numpy as np
from scipy import stats

def importance_sampling_estimate(
    f: callable,
    log_p: callable,
    log_q: callable,
    q_sampler: callable,
    n_samples: int = 100_000,
) -> tuple[float, float]:
    """Estimate E_p[f(X)] using importance sampling with proposal q.

    Args:
        f: Function to compute expectation of under p.
        log_p: Log-density of target distribution (up to constant).
        log_q: Log-density of proposal distribution.
        q_sampler: Function that draws n samples from q.
        n_samples: Number of importance samples.

    Returns:
        Tuple of (estimate, standard_error).
    """
    samples = q_sampler(n_samples)
    log_weights = log_p(samples) - log_q(samples)
    # Normalize weights for self-normalized importance sampling
    log_weights -= np.max(log_weights)  # numerical stability
    weights = np.exp(log_weights)
    weights /= weights.sum()

    f_values = f(samples)
    estimate = np.sum(weights * f_values)
    # Effective sample size (diagnostic for weight degeneracy)
    ess = 1.0 / np.sum(weights**2)

    return estimate, ess

# Example: Estimate E[X] where X ~ truncated normal on [2, inf)
# using a normal proposal
mu_target, sigma_target = 3.0, 1.0
# Analytical: E[X] for truncated normal
from scipy.stats import truncnorm
a_trunc = (2.0 - mu_target) / sigma_target  # lower bound in std units
rv_trunc = truncnorm(a=a_trunc, b=np.inf, loc=mu_target, scale=sigma_target)
analytical_mean = rv_trunc.mean()

# Importance sampling with N(3, 2) proposal
mu_q, sigma_q = 3.0, 2.0

estimate, ess = importance_sampling_estimate(
    f=lambda x: x,
    log_p=lambda x: np.where(
        x >= 2,
        stats.norm.logpdf(x, mu_target, sigma_target),
        -np.inf
    ),
    log_q=lambda x: stats.norm.logpdf(x, mu_q, sigma_q),
    q_sampler=lambda n: np.random.normal(mu_q, sigma_q, n),
    n_samples=100_000,
)

print(f"Analytical E[X]:           {analytical_mean:.6f}")
print(f"Importance sampling:       {estimate:.6f}")
print(f"Effective sample size:     {ess:.0f} / 100,000")

The Bootstrap: Resampling for Uncertainty

The bootstrap is a computational method for estimating the sampling distribution of a statistic without making distributional assumptions.

Algorithm: 1. Given data $\mathbf{x} = (x_1, \ldots, x_n)$, draw $B$ bootstrap samples, each of size $n$ with replacement. 2. Compute the statistic of interest $\hat{\theta}_b^*$ on each bootstrap sample. 3. The distribution of $\hat{\theta}_1^*, \ldots, \hat{\theta}_B^*$ approximates the sampling distribution of $\hat{\theta}$.

Bootstrap confidence intervals: the simplest approach (the "percentile method") uses the $\alpha/2$ and $1 - \alpha/2$ quantiles of the bootstrap distribution. More sophisticated methods (BCa, bias-corrected) improve coverage in small samples.

import numpy as np
from typing import Any

def bootstrap_ci(
    data: np.ndarray,
    statistic: callable,
    n_bootstrap: int = 10_000,
    confidence: float = 0.95,
    random_state: int = 42,
) -> tuple[float, float, float]:
    """Compute bootstrap confidence interval for a statistic.

    Args:
        data: Observed data array.
        statistic: Function that computes the statistic from data.
        n_bootstrap: Number of bootstrap resamples.
        confidence: Confidence level.
        random_state: Random seed for reproducibility.

    Returns:
        Tuple of (point_estimate, lower_bound, upper_bound).
    """
    rng = np.random.RandomState(random_state)
    n = len(data)
    point_estimate = statistic(data)

    bootstrap_stats = np.array([
        statistic(data[rng.choice(n, size=n, replace=True)])
        for _ in range(n_bootstrap)
    ])

    alpha = 1 - confidence
    lower = np.percentile(bootstrap_stats, 100 * alpha / 2)
    upper = np.percentile(bootstrap_stats, 100 * (1 - alpha / 2))

    return point_estimate, lower, upper

# StreamRec: Bootstrap CI for median session length and 90th percentile
np.random.seed(42)
session_lengths = np.random.exponential(scale=8.0, size=500)

# Median session length
median_est, median_lo, median_hi = bootstrap_ci(
    session_lengths, statistic=np.median
)
print(f"Median session length: {median_est:.2f} min")
print(f"95% Bootstrap CI: [{median_lo:.2f}, {median_hi:.2f}]")

# 90th percentile (how long are "power user" sessions?)
p90_est, p90_lo, p90_hi = bootstrap_ci(
    session_lengths, statistic=lambda x: np.percentile(x, 90)
)
print(f"\n90th percentile: {p90_est:.2f} min")
print(f"95% Bootstrap CI: [{p90_lo:.2f}, {p90_hi:.2f}]")

# Engagement rate (proportion)
clicks = np.random.binomial(1, 0.12, size=500)
ctr_est, ctr_lo, ctr_hi = bootstrap_ci(clicks, statistic=np.mean)
print(f"\nCTR: {ctr_est:.4f}")
print(f"95% Bootstrap CI: [{ctr_lo:.4f}, {ctr_hi:.4f}]")

Implementation Note: The bootstrap is distribution-free but computationally intensive. For large datasets, the "bag of little bootstraps" (Kleiner et al., 2014) combines subsampling with the bootstrap to scale to billions of records. In practice, $B = 10{,}000$ bootstrap samples is sufficient for stable 95% confidence intervals. For higher confidence levels (99.9%, as in some financial applications), increase to $B = 100{,}000$ or more.


3.13 Progressive Project: Bernoulli MLE and Bootstrap CIs for StreamRec

This section integrates the chapter's theory into the StreamRec recommendation system project.

Task 1: Derive and Implement the Bernoulli MLE

We model user engagement as a Bernoulli random variable: for each user-item pair $(u, i)$, the outcome $y_{ui} \in \{0, 1\}$ indicates whether the user engaged (clicked, completed, or rated) with the item.

import numpy as np
from scipy import stats

def simulate_streamrec_engagement(
    n_users: int = 5_000,
    n_items: int = 200,
    n_interactions: int = 50_000,
    base_ctr: float = 0.08,
    random_state: int = 42,
) -> dict[str, np.ndarray]:
    """Simulate StreamRec engagement data.

    Generates user-item interactions with heterogeneous engagement rates
    depending on user activity level and item popularity.

    Args:
        n_users: Number of users.
        n_items: Number of items.
        n_interactions: Total number of interactions to generate.
        base_ctr: Base click-through rate.
        random_state: Random seed.

    Returns:
        Dictionary with 'user_id', 'item_id', 'clicked' arrays.
    """
    rng = np.random.RandomState(random_state)

    # User activity: some users are more engaged than others
    user_activity = rng.beta(2, 5, size=n_users)  # right-skewed
    # Item popularity: some items are inherently more engaging
    item_popularity = rng.beta(2, 3, size=n_items)

    # Sample interactions
    user_ids = rng.choice(n_users, size=n_interactions, p=user_activity / user_activity.sum())
    item_ids = rng.choice(n_items, size=n_interactions)

    # Engagement probability = base_ctr * user_factor * item_factor
    p_engage = np.clip(
        base_ctr * (1 + user_activity[user_ids]) * (1 + item_popularity[item_ids]),
        0, 1
    )
    clicked = rng.binomial(1, p_engage)

    return {
        "user_id": user_ids,
        "item_id": item_ids,
        "clicked": clicked,
    }

# Generate data
data = simulate_streamrec_engagement()
print(f"Total interactions: {len(data['clicked']):,}")
print(f"Overall CTR (MLE): {data['clicked'].mean():.4f}")
print(f"This IS the Bernoulli MLE: p̂ = Σxᵢ / n = {data['clicked'].sum()} / {len(data['clicked'])}")

Task 2: Show MLE Equivalence to Cross-Entropy Minimization

import numpy as np

def negative_log_likelihood(p: float, data: np.ndarray) -> float:
    """Compute negative log-likelihood for Bernoulli model.

    Args:
        p: Bernoulli parameter (probability of success).
        data: Array of binary outcomes.

    Returns:
        Negative log-likelihood.
    """
    eps = 1e-12
    p = np.clip(p, eps, 1 - eps)
    return -np.mean(data * np.log(p) + (1 - data) * np.log(1 - p))

def binary_cross_entropy(p: float, data: np.ndarray) -> float:
    """Compute binary cross-entropy (same formula, different name).

    Args:
        p: Predicted probability.
        data: True binary labels.

    Returns:
        Binary cross-entropy.
    """
    eps = 1e-12
    p = np.clip(p, eps, 1 - eps)
    return -np.mean(data * np.log(p) + (1 - data) * np.log(1 - p))

# They are the same function
clicked = data["clicked"]
p_values = np.linspace(0.01, 0.99, 1000)
nll_values = [negative_log_likelihood(p, clicked) for p in p_values]
bce_values = [binary_cross_entropy(p, clicked) for p in p_values]

p_mle = clicked.mean()
p_min_nll = p_values[np.argmin(nll_values)]
p_min_bce = p_values[np.argmin(bce_values)]

print(f"MLE (analytical):              {p_mle:.6f}")
print(f"Minimizer of NLL (grid):       {p_min_nll:.6f}")
print(f"Minimizer of BCE (grid):       {p_min_bce:.6f}")
print(f"NLL == BCE everywhere:         {np.allclose(nll_values, bce_values)}")
print()
print("Conclusion: minimizing binary cross-entropy IS maximizing")
print("the Bernoulli log-likelihood. The loss function follows from")
print("the probabilistic model, not from arbitrary design choice.")

Task 3: Bootstrap Confidence Intervals for StreamRec Metrics

import numpy as np

def bootstrap_engagement_analysis(
    data: dict[str, np.ndarray],
    n_bootstrap: int = 10_000,
    confidence: float = 0.95,
    random_state: int = 42,
) -> None:
    """Compute bootstrap CIs for key StreamRec engagement metrics.

    Args:
        data: Dictionary with 'user_id', 'item_id', 'clicked' arrays.
        n_bootstrap: Number of bootstrap resamples.
        confidence: Confidence level.
        random_state: Random seed.
    """
    rng = np.random.RandomState(random_state)
    clicked = data["clicked"]
    n = len(clicked)
    alpha = 1 - confidence

    # Bootstrap distributions
    boot_ctr = np.zeros(n_bootstrap)
    boot_total_clicks = np.zeros(n_bootstrap)

    for b in range(n_bootstrap):
        indices = rng.choice(n, size=n, replace=True)
        boot_sample = clicked[indices]
        boot_ctr[b] = boot_sample.mean()
        boot_total_clicks[b] = boot_sample.sum()

    # Results
    ctr = clicked.mean()
    ctr_lo, ctr_hi = np.percentile(boot_ctr, [100 * alpha / 2, 100 * (1 - alpha / 2)])

    total = clicked.sum()
    total_lo, total_hi = np.percentile(boot_total_clicks, [100 * alpha / 2, 100 * (1 - alpha / 2)])

    print(f"Overall CTR:        {ctr:.4f}  [{ctr_lo:.4f}, {ctr_hi:.4f}]")
    print(f"Total clicks:       {total:,.0f}  [{total_lo:,.0f}, {total_hi:,.0f}]")

    # Per-item CTR for top items
    unique_items = np.unique(data["item_id"])
    top_items = sorted(unique_items, key=lambda i: data["clicked"][data["item_id"] == i].mean(),
                       reverse=True)[:5]

    print(f"\nTop 5 items by CTR (with bootstrap 95% CIs):")
    for item_id in top_items:
        mask = data["item_id"] == item_id
        item_clicks = data["clicked"][mask]
        item_n = len(item_clicks)
        item_ctr = item_clicks.mean()

        # Bootstrap for this item
        boot_item_ctr = np.array([
            item_clicks[rng.choice(item_n, size=item_n, replace=True)].mean()
            for _ in range(n_bootstrap)
        ])
        lo, hi = np.percentile(boot_item_ctr, [2.5, 97.5])
        print(f"  Item {item_id:>3d}: CTR = {item_ctr:.4f} [{lo:.4f}, {hi:.4f}] (n={item_n})")

bootstrap_engagement_analysis(data)

3.14 Connecting to What Comes Next

This chapter established the probabilistic foundation that the rest of the book builds on. Every concept introduced here will reappear — often in deeper form.

Chapter 4 (Information Theory) will show that cross-entropy loss is the negative log-likelihood IS the KL divergence from the true distribution (plus a constant). The chain from probability → MLE → loss function → information theory is the conceptual backbone of Part I.

Part II (Deep Learning) will treat neural networks as parameterized conditional distributions $p_\theta(y \mid x)$, trained by MLE (cross-entropy minimization). The distributions from Section 3.5 determine the output layer and loss function: Bernoulli → sigmoid + BCE, Categorical → softmax + CCE, Gaussian → identity + MSE.

Part III (Causal Inference) will extend conditional probability to causal reasoning. The conditional probability $P(Y \mid X = x)$ tells us about the association between $X$ and $Y$; the interventional distribution $P(Y \mid \text{do}(X = x))$ tells us about the causal effect. This distinction — previewed in the MediCore pharmaceutical example — is the central theme of Part III.

Part IV (Bayesian Methods) will develop the full Bayesian workflow: specifying priors, computing posteriors (via MCMC and variational inference), checking models via posterior predictive checks, and making decisions under uncertainty. The conjugate Beta-Binomial model from Section 3.4 is the simplest case; Part IV handles the general case where conjugacy is unavailable.


Summary

This chapter developed the mathematical language of uncertainty that underlies every ML algorithm. We began with the formal foundation of probability spaces and random variables, then built upward through joint and conditional distributions, Bayes' theorem, the exponential family, and maximum likelihood estimation. The centerpiece result — that common loss functions (MSE, BCE, CCE) are negative log-likelihoods of specific probability distributions — transforms loss function selection from arbitrary convention into principled modeling.

We distinguished frequentist and Bayesian inference not as competing philosophies but as complementary tools, each appropriate in different settings. We developed the asymptotic theory (CLT, LLN, Hoeffding's inequality) that governs how estimators behave and how many samples we need. And we introduced Monte Carlo methods — the computational bridge to problems where closed-form solutions are unavailable.

The probability theory in this chapter is older than computing itself. But it is not merely historical — it is the mathematical infrastructure that every modern ML algorithm runs on. Understanding this infrastructure deeply enough to derive it, rather than merely use it, is what enables the senior data scientist to design new methods, diagnose failures, and make principled decisions about modeling choices.

The loss function is not a design choice. It is a consequence of the probability model you have — implicitly or explicitly — assumed about your data. The only question is whether you made that assumption consciously.


Next: Chapter 4 — Information Theory for Data Science: Entropy, KL Divergence, and Why Your Loss Function Works