37 min read

> "Probability theory is nothing but common sense reduced to calculation."

Learning Objectives

  • Apply the axioms of probability and rules of conditional probability to compute event likelihoods
  • Derive and interpret Bayes' theorem in classification and inference tasks
  • Identify and parameterize common probability distributions used in AI models
  • Estimate distribution parameters using maximum likelihood estimation (MLE) and maximum a posteriori (MAP) estimation
  • Compute entropy, cross-entropy, and KL divergence to quantify information content and distribution differences
  • Explain mutual information and its role in feature selection and representation learning

Chapter 4: Probability, Statistics, and Information Theory

"Probability theory is nothing but common sense reduced to calculation." -- Pierre-Simon Laplace

Chapter Overview

Every time a language model predicts the next word, every time a classifier labels an email as spam, and every time a recommender system ranks items for a user, the machinery of probability theory is at work beneath the surface. If linear algebra (Chapter 2) provides the language for representing data and calculus (Chapter 3) provides the tools for optimization, then probability and statistics provide the framework for reasoning under uncertainty -- and uncertainty is the defining characteristic of every real-world AI problem.

This chapter builds the probabilistic and statistical foundations you will rely on throughout the rest of this book. We begin with the axioms of probability and the rules for combining and conditioning events. We then survey the probability distributions that appear most frequently in AI -- from the humble Bernoulli to the ubiquitous Gaussian. With distributions in hand, we turn to the twin pillars of statistical estimation: maximum likelihood estimation (MLE) and maximum a posteriori (MAP) estimation. Finally, we enter the territory of information theory, where concepts like entropy, cross-entropy, KL divergence, and mutual information provide powerful lenses for understanding learning, compression, and the flow of information through neural networks.

By the end of this chapter, you will not only understand the mathematical definitions but also see concretely how these concepts appear in loss functions, training objectives, generative models, and evaluation metrics throughout AI engineering.

In this chapter, you will learn to: - Manipulate probabilities using the sum rule, product rule, and Bayes' theorem - Work with discrete and continuous distributions in NumPy - Estimate parameters from data using MLE and MAP - Compute and interpret entropy, cross-entropy, and KL divergence - Apply mutual information to measure statistical dependence between variables


4.1 Foundations of Probability Theory

Before we can build probabilistic models, we need a rigorous yet practical foundation. Probability theory formalizes what it means to reason about uncertain events, and its axioms -- simple as they are -- give rise to all the sophisticated machinery we use in modern AI.

4.1.1 Sample Spaces, Events, and the Axioms of Probability

A sample space $\Omega$ is the set of all possible outcomes of a random experiment. An event $A$ is a subset of $\Omega$. A probability measure $P$ is a function that assigns a real number to each event, satisfying the three Kolmogorov axioms:

  1. Non-negativity: For any event $A$, $P(A) \geq 0$.
  2. Normalization: $P(\Omega) = 1$.
  3. Additivity: For any countable collection of mutually exclusive events $A_1, A_2, \ldots$,

$$ P\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i) $$

From these three axioms, we can derive everything else: $P(\emptyset) = 0$, $P(A^c) = 1 - P(A)$, the inclusion-exclusion principle, and so on.

Intuition: Think of probability as distributing a total "budget" of 1.0 across all possible outcomes. The axioms simply ensure that budgets are non-negative, the total budget is exactly 1, and non-overlapping events get separate allocations that add up correctly.

Example. When rolling a fair six-sided die, the sample space is $\Omega = \{1, 2, 3, 4, 5, 6\}$. The event "roll an even number" is $A = \{2, 4, 6\}$, and $P(A) = 3/6 = 0.5$.

4.1.2 Conditional Probability and the Product Rule

The conditional probability of event $A$ given that event $B$ has occurred is:

$$ P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0 $$

Rearranging gives us the product rule (also called the chain rule of probability):

$$ P(A \cap B) = P(A \mid B) \, P(B) = P(B \mid A) \, P(A) $$

This generalizes to any number of events:

$$ P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1) \prod_{i=2}^{n} P(A_i \mid A_1, \ldots, A_{i-1}) $$

The chain rule is fundamental to autoregressive language models, which factor the probability of a sentence as a product of conditional word probabilities -- a connection we will explore in depth in Chapter 21.

4.1.3 The Law of Total Probability and Marginalization

If $B_1, B_2, \ldots, B_n$ form a partition of the sample space (mutually exclusive and exhaustive), then for any event $A$:

$$ P(A) = \sum_{i=1}^{n} P(A \mid B_i) \, P(B_i) $$

This is the law of total probability, and it is the foundation of marginalization -- the process of summing (or integrating) over variables we do not observe. In AI, marginalization appears whenever we compute predictions that account for all possible values of latent variables, as in mixture models and variational autoencoders (Chapter 16).

4.1.4 Independence and Conditional Independence

Two events $A$ and $B$ are independent if:

$$ P(A \cap B) = P(A) \, P(B) $$

Equivalently, knowing $B$ tells us nothing about $A$: $P(A \mid B) = P(A)$.

Events $A$ and $B$ are conditionally independent given $C$ if:

$$ P(A \cap B \mid C) = P(A \mid C) \, P(B \mid C) $$

Conditional independence is the key simplifying assumption behind Naive Bayes classifiers (Case Study 1) and graphical models. It allows us to decompose complex joint distributions into manageable factors.

4.1.5 Bayes' Theorem

Combining the definition of conditional probability with the product rule gives us one of the most important results in all of AI:

$$ P(H \mid D) = \frac{P(D \mid H) \, P(H)}{P(D)} $$

where: - $P(H \mid D)$ is the posterior -- our updated belief about hypothesis $H$ after observing data $D$ - $P(D \mid H)$ is the likelihood -- the probability of observing the data under the hypothesis - $P(H)$ is the prior -- our belief about $H$ before seeing data - $P(D)$ is the evidence (or marginal likelihood) -- a normalizing constant

Intuition: Bayes' theorem is a principled update rule. You start with a prior belief, observe evidence, and the theorem tells you exactly how to revise your belief. The stronger the evidence, the more your posterior departs from the prior.

Worked Example: Medical Diagnosis

Problem: A disease affects 1% of the population. A test has 95% sensitivity (true positive rate) and 90% specificity (true negative rate). A person tests positive. What is the probability they actually have the disease?

Given: - $P(\text{disease}) = 0.01$, so $P(\text{no disease}) = 0.99$ - $P(\text{positive} \mid \text{disease}) = 0.95$ - $P(\text{positive} \mid \text{no disease}) = 0.10$

Solution:

Step 1: Compute the total probability of testing positive using the law of total probability:

$$ P(\text{positive}) = P(\text{pos} \mid \text{disease}) \, P(\text{disease}) + P(\text{pos} \mid \text{no disease}) \, P(\text{no disease}) $$

$$ = 0.95 \times 0.01 + 0.10 \times 0.99 = 0.0095 + 0.099 = 0.1085 $$

Step 2: Apply Bayes' theorem:

$$ P(\text{disease} \mid \text{positive}) = \frac{0.95 \times 0.01}{0.1085} \approx 0.0876 $$

Answer: Despite the positive test result, there is only about an 8.8% chance the person has the disease.

Interpretation: The low prior probability of the disease (1%) dominates. This counterintuitive result -- known as the base rate fallacy when people ignore it -- is one of the most practically important lessons from probability theory. It directly motivates the prior term in Bayesian machine learning, which we will encounter again in Chapter 10.

import numpy as np

# Bayes' theorem: Medical diagnosis example
p_disease = 0.01
p_positive_given_disease = 0.95
p_positive_given_no_disease = 0.10

# Law of total probability
p_positive = (p_positive_given_disease * p_disease +
              p_positive_given_no_disease * (1 - p_disease))

# Bayes' theorem
p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive
print(f"P(disease | positive test): {p_disease_given_positive:.4f}")  # 0.0876

# What if we test again and it is positive again (assuming independent tests)?
p_disease_after_two = (p_positive_given_disease * p_disease_given_positive) / (
    p_positive_given_disease * p_disease_given_positive +
    p_positive_given_no_disease * (1 - p_disease_given_positive)
)
print(f"P(disease | two positive tests): {p_disease_after_two:.4f}")  # ~0.4757

Notice how the posterior probability jumps from 8.8% to about 47.6% with a second positive test. This illustrates how Bayesian updating accumulates evidence: each new observation refines our belief.

4.1.6 Bayesian vs. Frequentist Perspectives

The meaning of "probability" itself has been debated for centuries, and the two dominant interpretations---frequentist and Bayesian---lead to fundamentally different approaches to statistical inference.

The Frequentist View. Probability is the long-run frequency of an event. The statement "$P(\text{heads}) = 0.5$" means that if you flip a coin infinitely many times, exactly half of the outcomes will be heads. In this framework, parameters are fixed (but unknown) constants, and probability statements can only be made about data given parameters: $p(\mathcal{D} \mid \theta)$.

The Bayesian View. Probability represents a degree of belief or state of knowledge. The statement "$P(\theta = 0.5) = 0.8$" means "I am 80% confident that $\theta$ is 0.5." In this framework, parameters themselves have probability distributions, and we can make statements like $p(\theta \mid \mathcal{D})$---the probability of parameter values given the observed data.

Aspect Frequentist Bayesian
Parameters Fixed unknown constants Random variables with distributions
Inference Point estimates and confidence intervals Posterior distributions and credible intervals
Prior knowledge Not formally incorporated Encoded via prior $p(\theta)$
Key output $p$-values, confidence intervals Posterior $p(\theta \mid \mathcal{D})$
Computational cost Usually cheaper Can be expensive (MCMC, variational inference)
With infinite data Both agree Both agree

In modern AI, both perspectives are valuable. Training a neural network with cross-entropy loss is a frequentist procedure (MLE). Adding weight decay is a Bayesian procedure (MAP with Gaussian prior). Bayesian neural networks (Chapter 10) maintain full posterior distributions over weights, enabling uncertainty quantification. In practice, the most effective AI systems often blend elements of both paradigms.

Historical Note. The Bayesian-frequentist debate has raged since the 18th century. Thomas Bayes (1701--1761) and Pierre-Simon Laplace laid the foundations of Bayesian inference, while Ronald Fisher, Jerzy Neyman, and Egon Pearson developed the frequentist framework in the early 20th century. Modern computational advances---particularly Markov Chain Monte Carlo (MCMC) methods, developed in the 1950s for physics and widely adopted in statistics by the 1990s---have made Bayesian methods practical for complex models, reigniting interest in the Bayesian approach.


4.2 Random Variables and Probability Distributions

With the foundations in place, we now formalize how to describe uncertain quantities numerically. A random variable is a function that maps outcomes in the sample space to real numbers. This abstraction lets us apply all the tools of calculus and linear algebra to probabilistic reasoning.

4.2.1 Discrete Random Variables and PMFs

A discrete random variable $X$ takes values from a countable set. Its behavior is fully described by its probability mass function (PMF):

$$ p(x) = P(X = x), \quad \sum_{x} p(x) = 1 $$

4.2.2 Continuous Random Variables and PDFs

A continuous random variable $X$ takes values from an uncountable set (typically a subset of $\mathbb{R}$). Its behavior is described by a probability density function (PDF) $f(x)$ such that:

$$ P(a \leq X \leq b) = \int_a^b f(x) \, dx, \quad \int_{-\infty}^{\infty} f(x) \, dx = 1 $$

Note that $f(x)$ itself is not a probability -- it can exceed 1. Only the integral over an interval gives a probability.

4.2.3 Expectation, Variance, and Covariance

The expectation (mean) of a random variable $X$ is the probability-weighted average of its values:

$$ \mathbb{E}[X] = \begin{cases} \sum_x x \, p(x) & \text{(discrete)} \\ \int_{-\infty}^{\infty} x \, f(x) \, dx & \text{(continuous)} \end{cases} $$

Key properties of expectation: - Linearity: $\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$ (always, even if $X$ and $Y$ are dependent) - Function of a random variable: $\mathbb{E}[g(X)] = \sum_x g(x) p(x)$ or $\int g(x) f(x) dx$

The variance measures spread around the mean:

$$ \text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 $$

The standard deviation is $\sigma = \sqrt{\text{Var}(X)}$, which has the same units as $X$.

The covariance between two random variables measures their linear co-movement:

$$ \text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] $$

If $X$ and $Y$ are independent, then $\text{Cov}(X, Y) = 0$ (but the converse is not true in general). The correlation coefficient normalizes covariance to $[-1, 1]$:

$$ \rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} $$

In Chapter 2, we represented collections of features as vectors. The covariance matrix $\Sigma$ extends the concept of variance to multivariate data, with $\Sigma_{ij} = \text{Cov}(X_i, X_j)$. This matrix is symmetric and positive semi-definite -- properties we studied in the context of eigendecomposition.

4.2.4 Common Discrete Distributions

The following distributions appear throughout AI and machine learning:

Bernoulli Distribution. Models a single binary trial with success probability $p$:

$$ P(X = x) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\} $$

  • $\mathbb{E}[X] = p$, $\text{Var}(X) = p(1-p)$
  • AI application: Binary classification outputs, coin flips, spam/not-spam labels

Binomial Distribution. The number of successes in $n$ independent Bernoulli trials:

$$ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k \in \{0, 1, \ldots, n\} $$

  • $\mathbb{E}[X] = np$, $\text{Var}(X) = np(1-p)$
  • AI application: Counting successes in batch evaluations, A/B testing

Categorical Distribution. Generalizes Bernoulli to $K$ categories with probabilities $p_1, \ldots, p_K$ where $\sum_k p_k = 1$. A single draw produces a one-hot vector.

  • AI application: Multi-class classification, language model next-token prediction

Poisson Distribution. Models the number of events in a fixed interval:

$$ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k \in \{0, 1, 2, \ldots\} $$

  • $\mathbb{E}[X] = \lambda$, $\text{Var}(X) = \lambda$
  • AI application: Count data, event modeling, arrival processes

4.2.5 Common Continuous Distributions

Uniform Distribution. All values in $[a, b]$ are equally likely:

$$ f(x) = \frac{1}{b - a}, \quad a \leq x \leq b $$

  • $\mathbb{E}[X] = \frac{a+b}{2}$, $\text{Var}(X) = \frac{(b-a)^2}{12}$
  • AI application: Random initialization, baseline for entropy comparisons

Gaussian (Normal) Distribution. The most important continuous distribution in AI:

$$ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) $$

  • $\mathbb{E}[X] = \mu$, $\text{Var}(X) = \sigma^2$
  • Notation: $X \sim \mathcal{N}(\mu, \sigma^2)$
  • AI applications: Weight initialization, noise modeling, variational autoencoders, Gaussian processes

The multivariate Gaussian extends to vector-valued random variables $\mathbf{x} \in \mathbb{R}^d$:

$$ f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})\right) $$

Here $\boldsymbol{\mu} \in \mathbb{R}^d$ is the mean vector and $\Sigma \in \mathbb{R}^{d \times d}$ is the covariance matrix. Notice how the matrix inverse and determinant from Chapter 2 appear naturally.

Real-World Application: The Gaussian distribution's dominance in AI is not merely conventional. The Central Limit Theorem guarantees that sums of many independent random variables converge to a Gaussian, regardless of the original distributions. Since many quantities in machine learning arise as aggregates of many small effects, the Gaussian is often a natural and well-justified modeling choice.

Exponential Distribution. Models the time between events in a Poisson process:

$$ f(x) = \lambda e^{-\lambda x}, \quad x \geq 0 $$

  • $\mathbb{E}[X] = 1/\lambda$, $\text{Var}(X) = 1/\lambda^2$
  • AI application: Survival analysis, time-to-event modeling

Beta Distribution. A distribution over probabilities, parameterized by $\alpha, \beta > 0$:

$$ f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}, \quad 0 \leq x \leq 1 $$

where $B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}$ is the Beta function.

  • $\mathbb{E}[X] = \frac{\alpha}{\alpha + \beta}$, $\text{Var}(X) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$
  • AI application: Prior for Bernoulli parameters in Bayesian inference, Thompson sampling in bandit algorithms

The Beta distribution is particularly important because it is a conjugate prior for the Bernoulli and Binomial likelihoods (we will explain conjugacy in Section 4.3).

Gamma Distribution. A flexible distribution for positive-valued random variables, parameterized by shape $\alpha > 0$ and rate $\beta > 0$:

$$ f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0 $$

where:

  • $\mathbb{E}[X] = \alpha / \beta$, $\text{Var}(X) = \alpha / \beta^2$,
  • the Exponential distribution is the special case $\alpha = 1$.

  • AI application: Prior for precision parameters (inverse variance) in Bayesian models, modeling waiting times and rates

Dirichlet Distribution. The multivariate generalization of the Beta distribution, defined over the probability simplex. For a $K$-dimensional probability vector $\mathbf{p} = (p_1, \ldots, p_K)$ with $\sum_k p_k = 1$:

$$ f(\mathbf{p}) = \frac{\Gamma(\sum_k \alpha_k)}{\prod_k \Gamma(\alpha_k)} \prod_{k=1}^{K} p_k^{\alpha_k - 1} $$

where:

  • $\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_K)$ are concentration parameters,
  • $\mathbb{E}[p_k] = \alpha_k / \sum_j \alpha_j$.

  • AI application: Prior for Categorical distributions, topic models (Latent Dirichlet Allocation), Bayesian multi-class classification

import numpy as np

# Sampling from common distributions
np.random.seed(42)

# Bernoulli: binary outcomes
bernoulli_samples = np.random.binomial(1, p=0.7, size=1000)
print(f"Bernoulli mean (should be ~0.7): {bernoulli_samples.mean():.3f}")

# Gaussian: continuous, bell-shaped
gaussian_samples = np.random.normal(loc=5.0, scale=2.0, size=1000)
print(f"Gaussian mean (should be ~5.0): {gaussian_samples.mean():.3f}")
print(f"Gaussian std (should be ~2.0): {gaussian_samples.std():.3f}")

# Beta: distribution over probabilities
beta_samples = np.random.beta(a=2, b=5, size=1000)
print(f"Beta mean (should be ~{2/(2+5):.3f}): {beta_samples.mean():.3f}")

# Dirichlet: distribution over probability vectors
dirichlet_samples = np.random.dirichlet(alpha=[2, 3, 5], size=5)
print(f"Dirichlet samples (rows sum to 1):\n{dirichlet_samples.round(3)}")
print(f"Row sums: {dirichlet_samples.sum(axis=1).round(3)}")
Distribution Type Parameters Mean Variance AI Use Case
Bernoulli Discrete $p$ $p$ $p(1-p)$ Binary classification
Binomial Discrete $n, p$ $np$ $np(1-p)$ Batch success counting
Categorical Discrete $p_1, \ldots, p_K$ -- -- Multi-class output
Poisson Discrete $\lambda$ $\lambda$ $\lambda$ Count data
Uniform Continuous $a, b$ $\frac{a+b}{2}$ $\frac{(b-a)^2}{12}$ Initialization
Gaussian Continuous $\mu, \sigma^2$ $\mu$ $\sigma^2$ Everywhere
Exponential Continuous $\lambda$ $1/\lambda$ $1/\lambda^2$ Time-to-event
Beta Continuous $\alpha, \beta$ $\frac{\alpha}{\alpha+\beta}$ -- Bayesian priors
Gamma Continuous $\alpha, \beta$ $\alpha/\beta$ $\alpha/\beta^2$ Precision priors
Dirichlet Continuous $\boldsymbol{\alpha}$ $\alpha_k / \sum \alpha_j$ -- Topic models

4.3 Statistical Estimation

Probability distributions describe the data-generating process, but in practice we do not know the true distribution. We must estimate distribution parameters from observed data. This section covers the two most important estimation frameworks in AI: maximum likelihood estimation and maximum a posteriori estimation.

4.3.1 The Estimation Problem

Suppose we observe data $\mathcal{D} = \{x_1, x_2, \ldots, x_n\}$ drawn independently from a distribution $p(x \mid \theta)$ parameterized by $\theta$. Our goal is to find the value of $\theta$ that best explains the observed data.

The assumption that observations are independent and identically distributed (i.i.d.) is crucial. It lets us write the joint probability as a product:

$$ p(\mathcal{D} \mid \theta) = \prod_{i=1}^{n} p(x_i \mid \theta) $$

4.3.2 Maximum Likelihood Estimation (MLE)

The maximum likelihood estimator chooses the parameter value that maximizes the likelihood of the observed data:

$$ \hat{\theta}_{\text{MLE}} = \arg\max_{\theta} \, p(\mathcal{D} \mid \theta) = \arg\max_{\theta} \prod_{i=1}^{n} p(x_i \mid \theta) $$

Because products are numerically unstable and analytically inconvenient, we almost always work with the log-likelihood:

$$ \hat{\theta}_{\text{MLE}} = \arg\max_{\theta} \sum_{i=1}^{n} \log p(x_i \mid \theta) $$

The log transformation is monotonic, so maximizing the log-likelihood yields the same solution as maximizing the likelihood itself.

Intuition: MLE asks: "If I had to pick one set of parameter values, which values would make the data I actually observed as probable as possible?" It is a purely data-driven approach with no prior assumptions about $\theta$.

Worked Example: MLE for a Gaussian

Problem: Given observations $x_1, \ldots, x_n$ from a Gaussian $\mathcal{N}(\mu, \sigma^2)$, find the MLE for $\mu$ and $\sigma^2$.

Solution:

The log-likelihood is:

$$ \ell(\mu, \sigma^2) = \sum_{i=1}^{n} \log \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2 $$

Taking the derivative with respect to $\mu$ and setting it to zero (using the calculus from Chapter 3):

$$ \frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^{n}(x_i - \mu) = 0 \quad \Rightarrow \quad \hat{\mu}_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^{n} x_i = \bar{x} $$

Similarly, taking the derivative with respect to $\sigma^2$:

$$ \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2 $$

Interpretation: The MLE for the mean is just the sample mean, and the MLE for the variance is the (biased) sample variance. These are the "most natural" estimates, and MLE gives us a principled justification for using them.

Common Pitfall: The MLE variance estimator divides by $n$, not $n-1$. The $n-1$ version (Bessel's correction) is an unbiased estimator, but it is not the MLE. In practice, for large $n$ the difference is negligible.

4.3.3 MLE and Cross-Entropy Loss

There is a deep and practically important connection between MLE and the cross-entropy loss function used to train classifiers and language models. Consider a classification problem where the model outputs a predicted distribution $q(y \mid x; \theta)$ and the true label is $y$. The negative log-likelihood over the training set is:

$$ \text{NLL}(\theta) = -\sum_{i=1}^{n} \log q(y_i \mid x_i; \theta) $$

This is exactly the cross-entropy between the empirical data distribution and the model's predicted distribution (we will make this precise in Section 4.5). So when you train a neural network with cross-entropy loss, you are performing maximum likelihood estimation.

4.3.4 Maximum a Posteriori (MAP) Estimation

MLE treats all parameter values as equally plausible a priori. MAP estimation incorporates a prior distribution $p(\theta)$ over parameters, using Bayes' theorem:

$$ \hat{\theta}_{\text{MAP}} = \arg\max_{\theta} \, p(\theta \mid \mathcal{D}) = \arg\max_{\theta} \, p(\mathcal{D} \mid \theta) \, p(\theta) $$

Taking logarithms:

$$ \hat{\theta}_{\text{MAP}} = \arg\max_{\theta} \left[\sum_{i=1}^{n} \log p(x_i \mid \theta) + \log p(\theta)\right] $$

The MAP estimate is the MLE objective plus a regularization term $\log p(\theta)$.

Connection to Regularization. If the prior is Gaussian, $p(\theta) = \mathcal{N}(0, \sigma_0^2 I)$, then:

$$ \log p(\theta) = -\frac{1}{2\sigma_0^2}\|\theta\|^2 + \text{const} $$

This is precisely L2 regularization (weight decay), which we encountered in Chapter 3's discussion of optimization and will revisit extensively in Chapter 13. Similarly, a Laplace prior yields L1 regularization (lasso).

Aspect MLE MAP
Prior No prior (or uniform prior) Explicit prior $p(\theta)$
Objective Maximize likelihood Maximize posterior
Regularization None Built-in via prior
Risk of overfitting Higher (especially with small data) Lower (prior acts as regularizer)
Limit as $n \to \infty$ MLE = MAP (data overwhelms prior) MLE = MAP

Advanced: Both MLE and MAP produce point estimates. Full Bayesian inference maintains the entire posterior distribution $p(\theta \mid \mathcal{D})$, which captures uncertainty about the parameters. We will explore Bayesian methods in depth in Chapter 10.

4.3.5 Conjugate Priors

In Bayesian inference, computing the posterior $p(\theta \mid \mathcal{D}) \propto p(\mathcal{D} \mid \theta) p(\theta)$ can be analytically intractable for arbitrary choices of prior and likelihood. A conjugate prior is a prior distribution that, when combined with a particular likelihood function, yields a posterior in the same family as the prior. This makes the math clean and the computation efficient.

The intuition is elegant: the prior encodes "pseudo-observations." A Beta(3, 7) prior for a coin's bias is equivalent to having already observed 3 heads and 7 tails before seeing any data. When new data arrives, we simply update the counts.

Likelihood Conjugate Prior Posterior
Bernoulli / Binomial Beta($\alpha, \beta$) Beta($\alpha + k, \beta + n - k$)
Categorical / Multinomial Dirichlet($\boldsymbol{\alpha}$) Dirichlet($\boldsymbol{\alpha} + \mathbf{c}$)
Poisson Gamma($\alpha, \beta$) Gamma($\alpha + \sum x_i, \beta + n$)
Gaussian (known variance) Gaussian($\mu_0, \sigma_0^2$) Gaussian (updated)

Worked Example: Beta-Binomial Conjugacy.

Suppose you are estimating the click-through rate (CTR) of an ad. Your prior belief is Beta(2, 8), meaning you expect roughly a 20% CTR but are not very certain. You then observe 15 clicks out of 100 impressions.

$$ p(\theta \mid \text{data}) = \text{Beta}(2 + 15, \, 8 + 85) = \text{Beta}(17, 93) $$

where:

  • the posterior mean is $17 / (17 + 93) = 0.155$,
  • this is a compromise between the prior mean (0.20) and the observed rate (0.15),
  • with more data, the posterior concentrates more tightly and the prior matters less.
import numpy as np

# Beta-Binomial conjugacy example
alpha_prior, beta_prior = 2, 8  # Prior: Beta(2, 8)
clicks, impressions = 15, 100    # Observed data

# Posterior parameters
alpha_post = alpha_prior + clicks
beta_post = beta_prior + (impressions - clicks)

# Compare prior, MLE, and posterior means
prior_mean = alpha_prior / (alpha_prior + beta_prior)
mle = clicks / impressions
posterior_mean = alpha_post / (alpha_post + beta_post)

print(f"Prior mean: {prior_mean:.3f}")
print(f"MLE: {mle:.3f}")
print(f"Posterior mean: {posterior_mean:.3f}")

# Posterior is a compromise, pulled toward prior
# With more data, posterior converges to MLE
for n in [10, 100, 1000, 10000]:
    k = int(0.15 * n)  # 15% true rate
    post_mean = (alpha_prior + k) / (alpha_prior + beta_prior + n)
    print(f"  n={n:5d}: posterior mean = {post_mean:.4f}")

Conjugate priors are not just a mathematical convenience---they appear in practical systems. Thompson sampling for multi-armed bandits (used by companies like Google and Facebook for A/B testing) relies on Beta-Bernoulli conjugacy to efficiently explore and exploit options. Latent Dirichlet Allocation (LDA) for topic modeling relies on Dirichlet-Multinomial conjugacy.

4.3.6 Properties of Estimators

When comparing estimation methods, we evaluate them by several criteria:

  • Bias: An estimator $\hat{\theta}$ is unbiased if $\mathbb{E}[\hat{\theta}] = \theta$. MLE is often biased for finite samples but asymptotically unbiased.
  • Consistency: $\hat{\theta}$ converges to the true $\theta$ as $n \to \infty$. Both MLE and MAP are consistent under mild conditions.
  • Efficiency: An estimator is efficient if it achieves the lowest possible variance among unbiased estimators. MLE is asymptotically efficient.

The bias-variance tradeoff is one of the central themes of machine learning: MAP estimation introduces bias (through the prior) but reduces variance, often leading to better generalization. This tradeoff is intimately connected to the regularization strategies we will discuss in Chapter 13.

4.3.7 Hypothesis Testing for Model Comparison

In AI engineering, you frequently need to answer questions like "Is model A significantly better than model B?" or "Did this change in the training pipeline actually improve performance?" Hypothesis testing provides a principled framework for answering such questions.

The basic procedure is:

  1. State a null hypothesis $H_0$ (e.g., "the two models have the same accuracy") and an alternative $H_1$ (e.g., "model A is more accurate").
  2. Choose a test statistic that measures how different the observed data is from what $H_0$ predicts.
  3. Compute the $p$-value: the probability of observing data as extreme as (or more extreme than) what we actually observed, assuming $H_0$ is true.
  4. Reject $H_0$ if the $p$-value is below a significance level $\alpha$ (typically 0.05).

Worked Example: Comparing Two Models. Suppose model A achieves 85% accuracy and model B achieves 82% accuracy on a test set of 500 examples. Is this difference statistically significant?

import numpy as np
from scipy import stats

# Model comparison via paired test
np.random.seed(42)
n_test = 500

# Simulate: both models predict on same test set
# Model A correct with prob 0.85, Model B with prob 0.82
model_a_correct = np.random.binomial(1, 0.85, n_test)
model_b_correct = np.random.binomial(1, 0.82, n_test)

# McNemar's test (for paired binary outcomes)
# Count discordant pairs
a_right_b_wrong = np.sum((model_a_correct == 1) & (model_b_correct == 0))
a_wrong_b_right = np.sum((model_a_correct == 0) & (model_b_correct == 1))

# McNemar's chi-squared statistic
chi2 = (a_right_b_wrong - a_wrong_b_right) ** 2 / (a_right_b_wrong + a_wrong_b_right)
p_value = 1 - stats.chi2.cdf(chi2, df=1)

print(f"Model A accuracy: {model_a_correct.mean():.3f}")
print(f"Model B accuracy: {model_b_correct.mean():.3f}")
print(f"McNemar's chi2: {chi2:.3f}")
print(f"p-value: {p_value:.4f}")
print(f"Significant at alpha=0.05: {p_value < 0.05}")

Common Pitfall: Never evaluate statistical significance using the training set---this is circular reasoning. Always use a held-out test set, and ideally run multiple trials with different random seeds.

4.3.8 Bootstrapping

Bootstrapping is a powerful resampling technique for estimating the uncertainty of any statistic without making distributional assumptions. The idea is simple: if we do not know the true data distribution, we approximate it with the empirical distribution and resample from that.

The procedure is:

  1. Given a dataset of $n$ observations, create $B$ bootstrap samples by drawing $n$ observations with replacement from the original data.
  2. Compute the statistic of interest on each bootstrap sample.
  3. Use the distribution of bootstrap statistics to estimate confidence intervals, standard errors, or $p$-values.

The intuition is that resampling with replacement simulates the variability we would see if we could collect new datasets from the true distribution.

Worked Example: Bootstrapping Model Accuracy.

import numpy as np

def bootstrap_confidence_interval(data, statistic_fn, n_bootstrap=10000,
                                   confidence=0.95):
    """Compute a bootstrap confidence interval.

    Args:
        data: Array of observations.
        statistic_fn: Function that computes the statistic of interest.
        n_bootstrap: Number of bootstrap samples.
        confidence: Confidence level (e.g., 0.95 for 95% CI).

    Returns:
        Tuple of (lower_bound, upper_bound).
    """
    np.random.seed(42)
    n = len(data)
    bootstrap_stats = np.array([
        statistic_fn(data[np.random.randint(0, n, n)])
        for _ in range(n_bootstrap)
    ])
    alpha = 1 - confidence
    lower = np.percentile(bootstrap_stats, 100 * alpha / 2)
    upper = np.percentile(bootstrap_stats, 100 * (1 - alpha / 2))
    return lower, upper

# Example: 95% CI for model accuracy
predictions_correct = np.array([1]*85 + [0]*15)  # 85% accuracy on 100 samples
lower, upper = bootstrap_confidence_interval(predictions_correct, np.mean)
print(f"Accuracy: {predictions_correct.mean():.2f}")
print(f"95% Bootstrap CI: [{lower:.3f}, {upper:.3f}]")

Bootstrapping is especially useful in AI because it makes no assumptions about the distribution of errors---it works for any metric (accuracy, F1, BLEU, etc.) and any model. It is the standard method for reporting confidence intervals in machine learning papers.

4.3.9 Concentration Inequalities: An Overview

While the law of large numbers tells us that sample averages converge to population averages, concentration inequalities tell us how fast this convergence happens. They provide finite-sample guarantees that are essential for understanding generalization in machine learning.

The most important concentration inequalities for AI are:

Markov's inequality: For any non-negative random variable $X$ and $t > 0$:

$$ P(X \geq t) \leq \frac{\mathbb{E}[X]}{t} $$

Chebyshev's inequality: For any random variable $X$ with finite variance and $t > 0$:

$$ P(|X - \mathbb{E}[X]| \geq t) \leq \frac{\text{Var}(X)}{t^2} $$

Hoeffding's inequality: For $n$ independent bounded random variables $X_i \in [a_i, b_i]$ and their average $\bar{X} = \frac{1}{n}\sum X_i$:

$$ P(|\bar{X} - \mathbb{E}[\bar{X}]| \geq t) \leq 2\exp\left(-\frac{2n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right) $$

The intuition is that averages of bounded random variables concentrate around their expected value exponentially fast in $n$. This is why test set evaluation is reliable: with 10,000 test samples, the observed accuracy is extremely unlikely to differ from the true accuracy by more than a few percentage points.

Hoeffding's inequality is a building block of PAC (Probably Approximately Correct) learning theory, which provides theoretical guarantees on how many samples a learning algorithm needs to achieve a desired accuracy. We will encounter these ideas again in Chapter 13 when we discuss generalization bounds.


4.4 Joint Distributions, Marginals, and the Multivariate World

Real-world AI problems involve multiple random variables. Understanding how they interact is essential for building models that capture the structure of data.

4.4.1 Joint and Marginal Distributions

For two discrete random variables $X$ and $Y$, the joint PMF is $p(x, y) = P(X = x, Y = y)$. The marginal PMF of $X$ is obtained by summing over $Y$:

$$ p(x) = \sum_y p(x, y) $$

For continuous variables, summation becomes integration:

$$ f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dy $$

4.4.2 Conditional Distributions

The conditional distribution of $Y$ given $X = x$ is:

$$ p(y \mid x) = \frac{p(x, y)}{p(x)} $$

This is the foundation of discriminative models in machine learning: we model $p(y \mid x)$ directly, predicting labels $y$ given features $x$. In contrast, generative models model the joint distribution $p(x, y)$ and use Bayes' theorem to derive predictions.

4.4.3 The Gaussian Mixture Model (Preview)

A powerful way to model complex distributions is as a mixture of simpler ones. The Gaussian Mixture Model (GMM) defines:

$$ p(x) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(x \mid \mu_k, \Sigma_k) $$

where $\pi_k$ are mixing weights with $\sum_k \pi_k = 1$. Each component $k$ is a Gaussian with its own mean $\mu_k$ and covariance $\Sigma_k$. GMMs illustrate how simple distributions combine to model complex, multimodal data. We will study GMMs in detail in Chapter 7 when we cover unsupervised learning.

4.4.4 Sampling and Monte Carlo Methods

When we cannot compute expectations or marginals analytically, we turn to Monte Carlo methods: generate samples from a distribution and use sample averages to approximate expectations:

$$ \mathbb{E}[g(X)] \approx \frac{1}{N}\sum_{i=1}^{N} g(x_i), \quad x_i \sim p(x) $$

By the law of large numbers, this approximation converges to the true expectation as $N \to \infty$. Monte Carlo estimation is the backbone of many modern AI techniques, from dropout training to variational inference to reinforcement learning (Chapter 36).

import numpy as np

# Monte Carlo estimation of E[X^2] where X ~ N(0, 1)
# Analytical answer: Var(X) + E[X]^2 = 1 + 0 = 1
np.random.seed(42)
samples = np.random.randn(100_000)
mc_estimate = np.mean(samples ** 2)
print(f"Monte Carlo estimate of E[X^2]: {mc_estimate:.4f}")  # ~1.0000

4.5 Information Theory

Information theory, founded by Claude Shannon in 1948, provides a mathematical framework for quantifying information, uncertainty, and the difference between probability distributions. These concepts have become central to modern AI, appearing in loss functions, generative models, representation learning, and evaluation metrics.

4.5.1 Entropy

The entropy of a discrete random variable $X$ with PMF $p(x)$ measures the average amount of "surprise" or "uncertainty" in the distribution:

$$ H(X) = -\sum_x p(x) \log p(x) $$

where the convention is $0 \log 0 = 0$ (by continuity).

Intuition: Entropy measures how unpredictable a random variable is. A fair coin has maximum entropy among binary outcomes ($H = 1$ bit when using $\log_2$). A biased coin with $p = 0.99$ has low entropy -- you can almost always predict the outcome.

Key properties of entropy: - $H(X) \geq 0$ (entropy is non-negative) - $H(X) = 0$ if and only if $X$ is deterministic - For $K$ outcomes, $H(X) \leq \log K$, with equality when $X$ is uniform - $H(X, Y) \leq H(X) + H(Y)$, with equality when $X$ and $Y$ are independent

Worked Example: Computing Entropy.

import numpy as np

def entropy(p):
    """Compute entropy of a discrete distribution.

    Args:
        p: Array of probabilities (must sum to 1).

    Returns:
        Entropy in nats.
    """
    p = p[p > 0]  # Remove zeros to avoid log(0)
    return -np.sum(p * np.log(p))

# Fair coin: maximum entropy for binary outcomes
p_fair = np.array([0.5, 0.5])
print(f"Fair coin entropy: {entropy(p_fair):.4f} nats ({entropy(p_fair)/np.log(2):.4f} bits)")

# Biased coin: lower entropy
p_biased = np.array([0.9, 0.1])
print(f"Biased coin entropy: {entropy(p_biased):.4f} nats")

# Uniform over 6 outcomes (fair die): maximum entropy for K=6
p_die = np.ones(6) / 6
print(f"Fair die entropy: {entropy(p_die):.4f} nats (max = ln(6) = {np.log(6):.4f})")

# A language model's next-word distribution
# Peaked distribution = confident prediction = low entropy
p_confident = np.array([0.8, 0.1, 0.05, 0.03, 0.02])
p_uncertain = np.ones(5) / 5
print(f"\nConfident prediction entropy: {entropy(p_confident):.4f}")
print(f"Uncertain prediction entropy: {entropy(p_uncertain):.4f}")

This example illustrates why entropy is called a measure of "surprise": a confident prediction (low entropy) means most of the probability mass is concentrated on one outcome, so the actual outcome is rarely surprising. A uniform prediction (maximum entropy) means every outcome is equally surprising.

The connection to optimal coding. Shannon's source coding theorem tells us that the entropy $H(X)$ is the minimum average number of bits needed to encode outcomes of $X$. This is why entropy is measured in bits (base-2 log) or nats (natural log). In AI, we typically use natural logarithms.

For a continuous random variable with PDF $f(x)$, we define the differential entropy:

$$ h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx $$

Unlike discrete entropy, differential entropy can be negative. For a Gaussian $\mathcal{N}(\mu, \sigma^2)$:

$$ h(X) = \frac{1}{2}\log(2\pi e \sigma^2) $$

This is the maximum differential entropy among all distributions with the same variance -- another reason the Gaussian is so fundamental.

4.5.2 Cross-Entropy

The cross-entropy between a true distribution $p$ and a model distribution $q$ is:

$$ H(p, q) = -\sum_x p(x) \log q(x) $$

Cross-entropy measures the average number of bits needed to encode data from distribution $p$ using a code optimized for distribution $q$. It satisfies:

$$ H(p, q) \geq H(p) $$

with equality if and only if $q = p$.

Real-World Application: Cross-entropy is the standard loss function for classification in neural networks. Given a one-hot true label $y$ and predicted probabilities $\hat{y}$, the cross-entropy loss for a single example is $-\sum_k y_k \log \hat{y}_k = -\log \hat{y}_c$ where $c$ is the true class. Minimizing cross-entropy is equivalent to maximum likelihood estimation (as we saw in Section 4.3.3).

Binary cross-entropy for a single Bernoulli trial with true probability $p$ and predicted probability $q$:

$$ H(p, q) = -[p \log q + (1-p) \log(1-q)] $$

This is the loss function used for binary classification and is a building block of the loss in logistic regression (Chapter 6).

4.5.3 Kullback-Leibler Divergence

The Kullback-Leibler (KL) divergence from distribution $q$ to distribution $p$ measures how much information is lost when $q$ is used to approximate $p$:

$$ D_{\text{KL}}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = H(p, q) - H(p) $$

Key properties: - $D_{\text{KL}}(p \| q) \geq 0$ (Gibbs' inequality), with equality iff $p = q$ - $D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p)$ in general -- KL divergence is not symmetric - It is not a true distance metric (violates symmetry and triangle inequality)

Intuition: KL divergence is the "extra cost" of encoding data from $p$ using a code designed for $q$, beyond the optimal cost $H(p)$. Since the true distribution $p$ has fixed entropy, minimizing cross-entropy $H(p, q)$ with respect to $q$ is equivalent to minimizing KL divergence.

Forward vs. Reverse KL. The asymmetry of KL divergence leads to important practical differences:

  • Forward KL $D_{\text{KL}}(p \| q)$: Often called the "moment-matching" direction. It forces $q$ to cover all modes of $p$ (mass-covering behavior). Used when fitting models to data via MLE.
  • Reverse KL $D_{\text{KL}}(q \| p)$: Often called the "mode-seeking" direction. It encourages $q$ to concentrate on a single mode of $p$ (mode-seeking behavior). Used in variational inference (Chapter 16).
Property Forward KL $D_{\text{KL}}(p \| q)$ Reverse KL $D_{\text{KL}}(q \| p)$
Optimizes $q$ to cover all of $p$ $q$ to match a mode of $p$
Behavior Mean-seeking / mass-covering Mode-seeking
Penalty High when $q(x) \approx 0$ but $p(x) > 0$ High when $p(x) \approx 0$ but $q(x) > 0$
Use case MLE, training with data Variational inference

4.5.4 Mutual Information

Mutual information quantifies how much knowing one variable tells us about another:

$$ I(X; Y) = D_{\text{KL}}(p(x, y) \| p(x)p(y)) = \sum_{x,y} p(x, y) \log \frac{p(x, y)}{p(x)p(y)} $$

Equivalently:

$$ I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X) = H(X) + H(Y) - H(X, Y) $$

where $H(X \mid Y) = -\sum_{x,y} p(x,y) \log p(x \mid y)$ is the conditional entropy.

Key properties: - $I(X; Y) \geq 0$, with equality iff $X$ and $Y$ are independent - $I(X; Y) = I(Y; X)$ (symmetric, unlike KL divergence) - $I(X; X) = H(X)$ (a variable is maximally informative about itself)

Real-World Application: Mutual information is used in feature selection to identify the most informative features for a prediction task. It captures both linear and nonlinear dependencies, making it more general than correlation. In representation learning, maximizing mutual information between inputs and learned representations is the principle behind methods like InfoNCE and contrastive learning (Chapter 16).

Worked Example: Mutual Information for Feature Selection.

import numpy as np

def estimate_mutual_information(x, y, n_bins=10):
    """Estimate mutual information between two variables using binning.

    Args:
        x: First variable (1D array).
        y: Second variable (1D array).
        n_bins: Number of bins for discretization.

    Returns:
        Estimated mutual information in nats.
    """
    # Discretize continuous variables
    x_binned = np.digitize(x, np.linspace(x.min(), x.max(), n_bins))
    y_binned = np.digitize(y, np.linspace(y.min(), y.max(), n_bins))

    # Compute joint and marginal distributions
    joint_hist = np.histogram2d(x_binned, y_binned, bins=n_bins)[0]
    joint_prob = joint_hist / joint_hist.sum()

    # Marginals
    px = joint_prob.sum(axis=1)
    py = joint_prob.sum(axis=0)

    # Mutual information
    mi = 0.0
    for i in range(n_bins):
        for j in range(n_bins):
            if joint_prob[i, j] > 0 and px[i] > 0 and py[j] > 0:
                mi += joint_prob[i, j] * np.log(
                    joint_prob[i, j] / (px[i] * py[j])
                )
    return mi

np.random.seed(42)
n = 5000

# Feature 1: strongly related to target
x1 = np.random.randn(n)
y = 2 * x1 + 0.5 * np.random.randn(n)  # Linear relationship

# Feature 2: nonlinearly related to target
x2 = np.random.randn(n)
y_nl = np.sin(3 * x2) + 0.3 * np.random.randn(n)

# Feature 3: independent of target
x3 = np.random.randn(n)

print(f"MI(x1, y) [linear]:     {estimate_mutual_information(x1, y):.3f}")
print(f"MI(x2, y_nl) [nonlinear]: {estimate_mutual_information(x2, y_nl):.3f}")
print(f"MI(x3, y) [independent]: {estimate_mutual_information(x3, y):.3f}")

Unlike correlation, mutual information detects the nonlinear relationship between $x_2$ and $y_{\text{nl}}$, making it a more general measure of dependence. This is why MI-based feature selection can outperform correlation-based methods when the relationships in your data are nonlinear.

4.5.5 Channel Capacity and the Limits of Communication

Shannon's original motivation for information theory was communication: how much information can be reliably transmitted over a noisy channel? The answer---the channel capacity---provides a fundamental limit that connects to AI in surprising ways.

For a discrete memoryless channel with input $X$ and output $Y$, the channel capacity is:

$$ C = \max_{p(x)} I(X; Y) $$

where:

  • the maximization is over all possible input distributions $p(x)$,
  • $I(X; Y)$ is the mutual information between input and output.

The intuition is that channel capacity measures the maximum rate at which information can be reliably communicated through the channel. Shannon's noisy channel coding theorem guarantees that communication at rates below $C$ can be made virtually error-free, while rates above $C$ inevitably introduce errors.

Why does this matter for AI? The information bottleneck theory of deep learning (Tishby et al., 2000) views each layer of a neural network as a communication channel. Information about the input $X$ flows through successive layers, and each layer can discard irrelevant information while preserving task-relevant information. The optimal representation at any layer maximizes $I(\text{representation}; Y)$ (informativeness about the label) while minimizing $I(\text{representation}; X)$ (complexity). This trade-off is precisely the information bottleneck objective, which we will explore in Chapter 16.

4.5.6 Information-Theoretic Relationships

The relationships between entropy, conditional entropy, joint entropy, and mutual information are elegantly summarized by an information diagram (Venn diagram):

    ┌──────────── H(X,Y) ────────────┐
    │                                  │
    │  ┌─── H(X) ───┐  ┌─── H(Y) ───┐│
    │  │             │  │             ││
    │  │  H(X|Y)     │  │    H(Y|X)  ││
    │  │         ┌───┴──┴───┐        ││
    │  │         │  I(X;Y)  │        ││
    │  │         └───┬──┬───┘        ││
    │  │             │  │             ││
    │  └─────────────┘  └─────────────┘│
    └──────────────────────────────────┘

These relationships give us: - $H(X, Y) = H(X) + H(Y \mid X) = H(Y) + H(X \mid Y)$ - $I(X; Y) = H(X) + H(Y) - H(X, Y)$ - $H(X \mid Y) = H(X) - I(X; Y)$

4.5.7 The Data Processing Inequality

An important consequence of information theory for AI is the data processing inequality: for any Markov chain $X \to Y \to Z$ (meaning $Z$ depends on $X$ only through $Y$):

$$ I(X; Z) \leq I(X; Y) $$

This tells us that processing data cannot create information -- every transformation can only lose or preserve information. This principle constrains what neural networks can achieve: if early layers discard information about the input, later layers cannot recover it. The information bottleneck theory of deep learning builds directly on this idea.

Practical consequence: This is why data preprocessing matters so much. If you discard a feature during preprocessing, no amount of model complexity can recover the information it contained. It is also why lossy compression of training images (e.g., aggressive JPEG compression) can hurt model accuracy---the compression step irreversibly destroys information that might have been useful for the prediction task.

4.5.9 KL Divergence and Cross-Entropy in Practice

Let us solidify the connection between KL divergence, cross-entropy, and model training with a concrete example.

Worked Example: Training a Classifier as KL Minimization.

Suppose we have a simple 3-class classification problem. The true distribution over classes for a particular input is $p = [0.7, 0.2, 0.1]$. Two models make different predictions:

import numpy as np

p = np.array([0.7, 0.2, 0.1])  # True distribution
q1 = np.array([0.6, 0.3, 0.1])  # Model 1 (decent)
q2 = np.array([0.3, 0.4, 0.3])  # Model 2 (poor)

def cross_entropy(p, q):
    return -np.sum(p * np.log(q + 1e-12))

def kl_divergence(p, q):
    return np.sum(p * np.log(p / (q + 1e-12)))

h_p = -np.sum(p * np.log(p))  # Entropy of true distribution

print(f"H(p) = {h_p:.4f} (entropy of true distribution)")
print(f"\nModel 1:")
print(f"  H(p, q1) = {cross_entropy(p, q1):.4f} (cross-entropy)")
print(f"  D_KL(p||q1) = {kl_divergence(p, q1):.4f} (KL divergence)")
print(f"  H(p, q1) = H(p) + D_KL: {h_p + kl_divergence(p, q1):.4f}")
print(f"\nModel 2:")
print(f"  H(p, q2) = {cross_entropy(p, q2):.4f} (cross-entropy)")
print(f"  D_KL(p||q2) = {kl_divergence(p, q2):.4f} (KL divergence)")

We see that Model 1 has lower cross-entropy and lower KL divergence---it is a better approximation of the true distribution. Since $H(p)$ is fixed (it depends only on the true distribution), minimizing cross-entropy $H(p, q)$ is equivalent to minimizing KL divergence $D_{\text{KL}}(p \| q)$. This is exactly what happens when we train a neural network classifier with cross-entropy loss.


4.6 Putting It All Together: Probability in AI Systems

Having established the theoretical foundations, let us now see how these concepts weave together in the context of AI systems. This section serves as a bridge to the machine learning chapters that follow.

4.6.1 The Probabilistic View of Machine Learning

Nearly every machine learning problem can be framed in probabilistic terms:

  • Supervised learning seeks to model $p(y \mid x; \theta)$ -- the conditional distribution of labels given inputs
  • Unsupervised learning seeks to model $p(x; \theta)$ -- the distribution of the data itself
  • Generative models learn $p(x)$ or $p(x \mid z)$ to generate new data
  • Reinforcement learning deals with $p(s' \mid s, a)$ -- the transition dynamics of environments

Training in each case involves choosing parameters $\theta$ to optimize a probabilistic objective -- typically maximizing a (log-)likelihood or minimizing a KL divergence.

4.6.2 Loss Functions as Information-Theoretic Quantities

The most common loss functions in deep learning have information-theoretic interpretations:

Loss Function Formula Information-Theoretic Interpretation
Cross-entropy $-\sum y_k \log \hat{y}_k$ Cross-entropy between true and predicted
Binary CE $-[y\log\hat{y} + (1-y)\log(1-\hat{y})]$ Binary cross-entropy
MSE $\frac{1}{n}\sum(y_i - \hat{y}_i)^2$ MLE under Gaussian noise assumption
KL loss (VAEs) $D_{\text{KL}}(q(z|x) \| p(z))$ Regularizer in variational lower bound

When you train a neural network to minimize cross-entropy loss, you are simultaneously: 1. Performing maximum likelihood estimation 2. Minimizing the KL divergence between the true data distribution and your model 3. Finding the model whose "code" most efficiently compresses the training data

4.6.3 From Softmax to Probabilities

The softmax function converts a vector of real-valued scores (logits) $z_1, \ldots, z_K$ into a valid probability distribution:

$$ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} $$

Properties: - Outputs are in $(0, 1)$ and sum to 1 - Preserves the ordering of the logits - Temperature parameter $T$ controls the "sharpness": $\text{softmax}(z_i / T)$. As $T \to 0$, the distribution concentrates on the maximum; as $T \to \infty$, it approaches uniform.

The softmax function is the natural link between the raw outputs of a neural network and the probabilistic framework. We first encountered it implicitly in Chapter 3 when discussing optimization, and it will be a constant companion in the classification chapters ahead (Chapter 6), transformer architectures (Chapter 19), and language model decoding (Chapter 21).

4.6.4 Calibration: When Probabilities Should Mean What They Say

A classifier's predicted probabilities should ideally match the true frequencies of correct predictions. When a model says "90% confident," it should be correct 90% of the time. A model with this property is called well-calibrated.

Formally, a model is perfectly calibrated if:

$$ P(Y = y \mid \hat{p}(Y = y) = q) = q \quad \text{for all } q \in [0, 1] $$

In practice, modern neural networks tend to be overconfident---they assign high probabilities to their predictions more often than is warranted. Temperature scaling (dividing logits by a learned temperature $T > 1$ before softmax) is the simplest and most effective post-hoc calibration method.

import numpy as np

# Simulate an overconfident model
np.random.seed(42)
n = 1000
true_labels = np.random.binomial(1, 0.5, n)

# Model predicts with overconfident probabilities
raw_probs = np.where(true_labels == 1,
                      np.random.beta(5, 1, n),   # High prob when correct
                      np.random.beta(1, 5, n))    # Low prob when incorrect

# Check calibration: bin predictions and compare to actual frequency
n_bins = 10
bin_edges = np.linspace(0, 1, n_bins + 1)
for i in range(n_bins):
    mask = (raw_probs >= bin_edges[i]) & (raw_probs < bin_edges[i + 1])
    if mask.sum() > 0:
        predicted = raw_probs[mask].mean()
        actual = true_labels[mask].mean()
        print(f"Bin [{bin_edges[i]:.1f}, {bin_edges[i+1]:.1f}): "
              f"predicted={predicted:.3f}, actual={actual:.3f}, "
              f"n={mask.sum()}")

Calibration is essential in high-stakes applications like medical diagnosis and autonomous driving, where downstream decisions depend on the magnitude of the predicted probability, not just the ranking. We will revisit calibration in Chapter 13 when we discuss model evaluation.

4.6.5 The Reparameterization Trick (Preview)

In variational autoencoders (VAEs) and other generative models, we need to take gradients through sampling operations. The reparameterization trick rewrites a sample from $\mathcal{N}(\mu, \sigma^2)$ as:

$$ x = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1) $$

This moves the randomness into a fixed distribution $\epsilon$, making the sampling operation differentiable with respect to $\mu$ and $\sigma$. It combines the probability theory from this chapter with the automatic differentiation from Chapter 3. We will explore this technique fully in Chapter 16.

4.6.6 Practical Numerical Considerations

Working with probabilities on a computer requires care to avoid numerical issues:

Log-space computation. Products of many small probabilities underflow to zero. Always work in log-space:

import numpy as np

# BAD: Underflows for large n
probs = np.array([0.001] * 1000)
product = np.prod(probs)  # 0.0 (underflow!)

# GOOD: Stable in log-space
log_probs = np.log(probs)
log_product = np.sum(log_probs)  # -6907.8 (correct)

The log-sum-exp trick. Computing $\log \sum_i \exp(a_i)$ naively can overflow. Instead:

$$ \log \sum_i \exp(a_i) = a_{\max} + \log \sum_i \exp(a_i - a_{\max}) $$

def log_sum_exp(a: np.ndarray) -> float:
    """Compute log(sum(exp(a))) in a numerically stable way.

    Args:
        a: Array of log values.

    Returns:
        The log of the sum of the exponentials.
    """
    a_max = np.max(a)
    return a_max + np.log(np.sum(np.exp(a - a_max)))

NumPy provides np.logaddexp for pairs and scipy.special.logsumexp for arrays, but understanding the underlying trick is valuable for debugging numerical issues in your models.


4.7 Chapter Summary

This chapter has built the probabilistic, statistical, and information-theoretic foundations that permeate every aspect of modern AI:

Key Concepts

  1. Probability axioms provide a rigorous foundation for reasoning under uncertainty. The sum rule, product rule, and Bayes' theorem are the basic tools.
  2. Common distributions -- Bernoulli, Categorical, Gaussian, and others -- are the building blocks for modeling data in AI systems.
  3. Expectation and variance summarize distributions and appear in loss functions, evaluation metrics, and optimization.
  4. Maximum likelihood estimation finds parameters that maximize the probability of observed data and is equivalent to minimizing cross-entropy loss.
  5. MAP estimation incorporates prior knowledge and corresponds to regularized MLE.
  6. Entropy quantifies uncertainty and sets the theoretical limit for data compression.
  7. Cross-entropy is the standard classification loss and measures the cost of using an imperfect model.
  8. KL divergence measures the "distance" between distributions and connects training objectives to information-theoretic optimality.
  9. Mutual information captures all statistical dependencies between variables and underpins feature selection and representation learning.

Key Formulas

Name Formula When to Use
Bayes' theorem $P(H \mid D) = \frac{P(D \mid H) P(H)}{P(D)}$ Inference, classification
Gaussian PDF $\frac{1}{\sqrt{2\pi\sigma^2}} e^{-(x-\mu)^2/(2\sigma^2)}$ Continuous modeling
MLE $\hat{\theta} = \arg\max_\theta \sum_i \log p(x_i \mid \theta)$ Parameter estimation
MAP $\hat{\theta} = \arg\max_\theta [\sum_i \log p(x_i \mid \theta) + \log p(\theta)]$ Regularized estimation
Entropy $H(X) = -\sum_x p(x) \log p(x)$ Measure uncertainty
Cross-entropy $H(p, q) = -\sum_x p(x) \log q(x)$ Classification loss
KL divergence $D_{\text{KL}}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)}$ Distribution comparison
Mutual information $I(X;Y) = H(X) - H(X \mid Y)$ Feature selection, dependence

Key Code Patterns

import numpy as np

# Pattern 1: Stable log-probability computation
log_probs = np.log(probs + 1e-12)  # Add epsilon to avoid log(0)
total_log_prob = np.sum(log_probs)

# Pattern 2: Cross-entropy loss
def cross_entropy(p: np.ndarray, q: np.ndarray) -> float:
    return -np.sum(p * np.log(q + 1e-12))

# Pattern 3: KL divergence
def kl_divergence(p: np.ndarray, q: np.ndarray) -> float:
    return np.sum(p * np.log(p / (q + 1e-12) + 1e-12))

Decision Framework

Which estimation method should you use?

├── Do you have prior knowledge about parameters?
│   ├── Yes → Consider MAP estimation
│   │   ├── Want L2 regularization? → Gaussian prior
│   │   └── Want L1 regularization? → Laplace prior
│   └── No → Use MLE
├── Do you want uncertainty in parameter estimates?
│   ├── Yes → Full Bayesian inference (Chapter 10)
│   └── No → MLE or MAP (point estimate)
├── Which loss function should you use?
│   ├── Classification → Cross-entropy loss (= negative log-likelihood)
│   ├── Regression → MSE (= MLE under Gaussian noise)
│   └── Generative model → ELBO with KL term (Chapter 16)

Looking Ahead

The probability and information theory from this chapter will be constant companions as we proceed. In Chapter 5, we will strengthen our Python skills for implementing these ideas efficiently. In Part II, every supervised and unsupervised learning algorithm will be framed as a probabilistic model trained by MLE or MAP. In Part III, the cross-entropy loss will be the default training objective for neural networks. And in Part IV, language models will be understood as autoregressive models computing $p(x_t \mid x_1, \ldots, x_{t-1})$ -- a direct application of the chain rule of probability.

The mathematical toolkit from Parts I is now nearly complete. With linear algebra, calculus, and probability in hand, you are ready to tackle the algorithms and architectures that make modern AI work.