39 min read

> "The theory of probabilities is at bottom nothing but common sense reduced to calculus."

Prerequisites

  • 16

Learning Objectives

  • Model an experiment as a probability space — a sample space of outcomes with a probability assignment — and identify events as subsets.
  • Compute the probability of an event under the equally-likely (Laplace) model by counting favorable and total outcomes, and apply the basic probability axioms and their consequences.
  • Define a random variable as a function on the sample space and read off its distribution.
  • Compute expected value, and use linearity of expectation to find expectations that would be hopeless by brute force.
  • Compute variance and standard deviation, and explain what they measure that expectation does not.
  • Apply the probabilistic method to prove that an object exists by showing a random object has the desired property with positive probability — and verify probabilistic claims by Monte-Carlo simulation.

Chapter 20: Discrete Probability

"The theory of probabilities is at bottom nothing but common sense reduced to calculus." — Pierre-Simon Laplace, Théorie analytique des probabilités (1812)

Overview

Your code is full of randomness whether you invited it or not. A hash table spreads keys across buckets and you want to know how full the fullest bucket gets. A load balancer sprays requests across servers and you want to know whether one server is about to fall over. Quicksort picks a random pivot and you want to know whether it will run in $n\log n$ time or blow up to $n^2$. A randomized algorithm flips coins internally and you want to know whether its answer is probably right — and how probable "probably" is. A bloom filter, a skip list, a Monte-Carlo integration, a retry-with-backoff loop, a machine-learning model trained on a random sample: every one of these is a program whose behavior you can only describe with probability.

Chapter 16 taught you to count outcomes. Probability is what you get when you take those counts and ask not "how many?" but "how likely?" The bridge is almost embarrassingly direct: when every outcome is equally likely, the probability of an event is just the count of favorable outcomes divided by the count of all outcomes — two numbers you already know how to compute. From that single idea we will build the machinery that working computer scientists actually use: random variables (numbers that depend on chance), expected value (the long-run average, and the most useful single number in all of applied probability), variance (how wildly the value swings around that average), and the probabilistic method — a startling proof technique that establishes an object exists by showing a random object has the property you want with probability greater than zero.

That last technique is worth pausing on, because it captures the spirit of this entire book. It uses probability to prove a statement that has nothing to do with chance — a purely combinatorial fact about, say, the existence of a good error-correcting code or a balanced graph coloring. It is a proof, so it guarantees the object exists, yet it never constructs one. That is theme four of this book — computation and proof are complementary — turned all the way up: we will use simulation to estimate probabilities on a million trials (cheap, approximate, fallible) and proof to settle them exactly (expensive, certain), and you will learn when each is the right tool.

In this chapter, you will learn to:

  • Build a probability space — a sample space of outcomes with probabilities that sum to 1 — and treat events as subsets you can union, intersect, and complement.
  • Compute event probabilities by counting (the equally-likely model) and by applying the probability axioms and their consequences (complement, union, monotonicity).
  • Define a random variable as a function from outcomes to numbers, and tabulate its distribution.
  • Compute expected value, and wield linearity of expectation — the most powerful labor-saving device in the subject — to compute averages that direct enumeration could never reach.
  • Compute variance and standard deviation, and say precisely what they measure that the mean does not.
  • Use the probabilistic method to prove existence, and use Monte-Carlo simulation to check a probability empirically before (or instead of) computing it exactly.

Learning Paths

🏃 Fast Track: If basic probability is review, skim 20.1–20.2 for our notation, then go straight to 20.4 (expected value and linearity — the section with the most leverage for CS) and 20.6 (the probabilistic method and simulation, almost certainly new). Do the ⭐⭐ and ⭐⭐⭐ exercises.

📖 Standard Path: Read in order. The conceptual hinge is 20.3–20.4: a random variable is a function, and its expectation is a weighted average. Work every 🔄 Check Your Understanding before moving on, and try each worked example's setup yourself before reading the resolution.

🔬 Deep Dive: After the chapter, prove linearity of expectation for the general (non-independent) case from the definition, then use it to derive the expected number of fixed points of a random permutation and the expected length of the longest run in $n$ coin flips. Exercise set, Part E.


20.1 Sample spaces and events

Start with the simplest random thing there is: a single roll of a fair six-sided die. Before you roll, you cannot say what will happen, but you can write down everything that could happen — the set of possible outcomes is $\{1, 2, 3, 4, 5, 6\}$. That set, the exhaustive list of mutually exclusive things that could occur, is the foundation everything else is built on.

Definition (sample space, outcome). The sample space of an experiment is the set $S$ of all possible outcomes, where the outcomes are mutually exclusive (exactly one occurs) and exhaustive (at least one occurs). In this chapter $S$ is always finite or countable — hence discrete probability.

For one die roll, $S = \{1,2,3,4,5,6\}$, with $\lvert S \rvert = 6$. For a single coin flip, $S = \{H, T\}$. For two coin flips, an outcome must record both flips in order, so $S = \{HH, HT, TH, TT\}$ — this is exactly the Cartesian product $\{H,T\} \times \{H,T\}$ from Chapter 8, and counting it is the product rule from Chapter 15: $2 \times 2 = 4$ outcomes. The recurring discipline of this chapter is choose the sample space deliberately, because almost every probability mistake is really a sample-space mistake.

Usually we do not care about a single outcome but about a collection of them: "the die shows an even number," "at least one of the two coins is heads," "the request hashed to bucket 3." Each such statement picks out a subset of $S$.

Definition (event). An event is a subset $E \subseteq S$ of the sample space. We say the event $E$ occurs when the experiment's outcome is an element of $E$. The whole sample space $S$ is the certain event; the empty set $\emptyset$ is the impossible event.

Because events are sets, all of the set operations from Chapter 8 are operations on events, and they mean exactly what English says:

Set operation As an event Reads as
$E \cup F$ union "$E$ or $F$ occurs"
$E \cap F$ intersection "$E$ and $F$ both occur"
$\overline{E} = S \setminus E$ complement "$E$ does not occur"
$E \cap F = \emptyset$ disjoint events $E$ and $F$ are mutually exclusive

🔗 Connection. This is not an analogy — events are sets, and the logic of "and / or / not" from Chapter 1 is the algebra of $\cap, \cup, \overline{\phantom{E}}$ from Chapter 8. When you later read if hashed_bucket == 3 and not table_is_full, you are naming an event $E \cap \overline{F}$ over the sample space of possible program states. Probability is just a measure laid on top of set theory you already know.

For the die, the event "even" is $E = \{2,4,6\}$, the event "greater than 4" is $F = \{5,6\}$, "even or greater than 4" is $E \cup F = \{2,4,5,6\}$, and "even and greater than 4" is $E \cap F = \{6\}$. The complement of "even" is "odd," $\overline{E} = \{1,3,5\}$. Nothing here is new mathematics — it is Chapter 8 wearing a new hat — but getting fluent at translating an English question into the right subset is the whole game.

🔄 Check Your Understanding 1. You draw one card from a standard 52-card deck. Write the sample space's size, and the event "the card is a heart" as a set size. 2. For the two-coin experiment $S = \{HH, HT, TH, TT\}$, write the event "exactly one head" and the event "at least one head" as subsets. 3. Express "neither $E$ nor $F$ occurs" using set operations on $E$ and $F$.

Answers

  1. $\lvert S \rvert = 52$; "heart" is a 13-element event. 2. "Exactly one head" $= \{HT, TH\}$; "at least one head" $= \{HH, HT, TH\}$ (everything except $TT$). 3. "Neither" $= \overline{E} \cap \overline{F} = \overline{E \cup F}$ — De Morgan's law from Chapter 1, now an identity about events.

20.2 Probability axioms and the equally-likely model

We have outcomes and events; now we assign numbers. A probability function $P$ gives each event a number in $[0,1]$ measuring how likely it is, subject to three rules that any sensible notion of "chance" must obey. These are the axioms Kolmogorov distilled in 1933, and everything in probability is a logical consequence of them.

The probability axioms. A probability function on a sample space $S$ assigns to each event $E \subseteq S$ a real number $P(E)$ such that

  1. Non-negativity: $P(E) \ge 0$ for every event $E$.
  2. Normalization: $P(S) = 1$ (some outcome certainly occurs).
  3. Additivity: if $E$ and $F$ are disjoint ($E \cap F = \emptyset$), then $P(E \cup F) = P(E) + P(F)$. (More generally this holds for any countable collection of pairwise disjoint events.)

For a discrete sample space, a probability function is determined by what it does to the individual outcomes: assign each outcome $s \in S$ a weight $p(s) \ge 0$ with $\sum_{s \in S} p(s) = 1$, and then the probability of any event is the sum of the weights of its outcomes: $$P(E) = \sum_{s \in E} p(s).$$ The most important special case — and the one where Chapter 16's counting pays off immediately — is when every outcome is equally likely.

Definition (equally-likely / Laplace model). When a finite sample space $S$ has all outcomes equally likely, each outcome has weight $1/\lvert S \rvert$, and for every event $E$, $$P(E) = \frac{\lvert E \rvert}{\lvert S \rvert} = \frac{\text{number of favorable outcomes}}{\text{number of possible outcomes}}.$$

This is the formula your intuition already uses, and it converts every equally-likely probability question into two counting problems — exactly the skills of Chapters 15–16. Probability of rolling an even number: favorable $= \lvert \{2,4,6\} \rvert = 3$, total $= 6$, so $P = 3/6 = 1/2$. Probability that a random 5-card poker hand is a flush (all one suit): favorable $= 4\binom{13}{5}$ (choose the suit, then 5 of its 13 cards), total $= \binom{52}{5} = 2{,}598{,}960$ (Chapter 16), giving $P = \frac{4 \cdot 1287}{2598960} = \frac{5148}{2598960} \approx 0.00198$.

A few consequences of the axioms get used so often they deserve names; each follows from the three axioms in a line or two.

Consequences of the axioms. For events $E, F$ in a sample space $S$: - Complement rule: $P(\overline{E}) = 1 - P(E)$. - Impossible event: $P(\emptyset) = 0$. - Monotonicity: if $E \subseteq F$ then $P(E) \le P(F)$. - Inclusion–exclusion (general union): $P(E \cup F) = P(E) + P(F) - P(E \cap F)$.

Let us prove the two that matter most. The complement rule is the single most useful probability identity in practice — whenever an event is awkward to count directly ("at least one…"), its complement is usually easy.

The strategy first. $E$ and $\overline{E}$ are disjoint by construction, and together they make up all of $S$. So additivity (axiom 3) splits $P(S)$ into the two pieces, and normalization (axiom 2) says that total is 1. Solve for $P(\overline{E})$.

Proof (complement rule). The events $E$ and $\overline{E}$ are disjoint and their union is $S$. By additivity, $P(E) + P(\overline{E}) = P(E \cup \overline{E}) = P(S)$. By normalization $P(S) = 1$, so $P(E) + P(\overline{E}) = 1$, hence $P(\overline{E}) = 1 - P(E)$. $\blacksquare$

Now the union formula, which is just Chapter 15's inclusion–exclusion lifted from counts to probabilities.

The strategy first. Additivity only applies to disjoint events, but $E$ and $F$ may overlap. The fix is to carve $E \cup F$ into disjoint pieces, add those with axiom 3, and notice that adding $P(E)$ and $P(F)$ naively double-counts the overlap $E \cap F$ exactly once.

Proof (inclusion–exclusion). Write $E \cup F$ as the disjoint union of three pieces: $E \setminus F$, $F \setminus E$, and $E \cap F$. By additivity, $$P(E \cup F) = P(E \setminus F) + P(F \setminus E) + P(E \cap F).$$ Separately, $E$ is the disjoint union of $E \setminus F$ and $E \cap F$, so $P(E) = P(E \setminus F) + P(E \cap F)$; likewise $P(F) = P(F \setminus E) + P(E \cap F)$. Adding these two, $P(E) + P(F) = P(E\setminus F) + P(F \setminus E) + 2\,P(E \cap F)$. Subtracting one copy of $P(E \cap F)$ recovers the previous display, so $P(E \cup F) = P(E) + P(F) - P(E \cap F)$. $\blacksquare$

Here is the workhorse example for the complement rule, and the first place this chapter touches a real program.

Worked example (the birthday surprise). In a room of $n$ people, what is the probability that at least two share a birthday? Assume 365 equally likely birthdays and independent people. "At least two share" is a messy event — two might match, or three, or two separate pairs. Its complement, "all $n$ birthdays are distinct," is clean. Order the people and count: the sample space of birthday-tuples has $365^n$ outcomes (product rule), and the favorable-to-the-complement outcomes — all distinct — number $365 \cdot 364 \cdots (365 - n + 1) = P(365, n)$ (a falling product, Chapter 16). So $$P(\text{some shared birthday}) = 1 - \frac{P(365, n)}{365^n} = 1 - \frac{365 \cdot 364 \cdots (365-n+1)}{365^n}.$$ The shock is how fast this climbs: it crosses $1/2$ at just $n = 23$ and exceeds $0.999$ by $n = 70$.

def birthday_collision_prob(n, days=365):
    """P(at least two of n people share a birthday), via the complement."""
    p_all_distinct = 1.0
    for k in range(n):
        p_all_distinct *= (days - k) / days   # k people already placed
    return 1 - p_all_distinct

print(round(birthday_collision_prob(23), 4))
print(round(birthday_collision_prob(50), 4))
# Expected output:
# 0.5073
# 0.9704

🔗 Connection. This is not a party trick — it is the mathematics of hash collisions. With a hash table of $m$ buckets and $n$ keys hashed uniformly, "two keys collide" is the birthday problem with $\text{days} = m$. The same falling product says collisions become likely once $n$ grows to about $\sqrt{m}$ — far sooner than the table is full. Chapter 17's pigeonhole principle tells you collisions are guaranteed once $n > m$; the birthday calculation tells you they are probable much earlier. We return to this when we build hash functions in Chapter 26.

⚠️ Common Pitfall: the equally-likely assumption is an assumption. The formula $P(E) = \lvert E\rvert / \lvert S\rvert$ holds only when outcomes are equally likely. The classic trap: for two coin flips, a student declares the sample space to be $\{$ two heads, one head, no heads $\}$ and concludes $P(\text{two heads}) = 1/3$. But those three outcomes are not equally likely. The correct equally-likely space is $\{HH, HT, TH, TT\}$, in which "exactly one head" is $\{HT, TH\}$, so $P(\text{one head}) = 2/4 = 1/2$ and $P(\text{two heads}) = 1/4$. Always build a sample space whose outcomes are genuinely equally likely before you divide.

🔄 Check Your Understanding 1. State the complement rule and use it to find $P(\text{at least one head})$ in three coin flips. 2. A fair die is rolled. Using inclusion–exclusion, find $P(\text{even or} > 4)$ from $P(\text{even}) = 3/6$, $P(>4) = 2/6$, and $P(\text{even and} >4) = 1/6$. 3. Why is $P(\emptyset) = 0$ forced by the axioms? (One line.)

Answers

  1. $P(\overline{E}) = 1 - P(E)$. "At least one head" is the complement of "all tails," and all-tails has probability $1/8$, so the answer is $1 - 1/8 = 7/8$. 2. $P = 3/6 + 2/6 - 1/6 = 4/6 = 2/3$, matching $\lvert\{2,4,5,6\}\rvert/6$. 3. $S$ and $\emptyset$ are disjoint with union $S$, so $P(S) = P(S) + P(\emptyset)$; subtracting $P(S)=1$ gives $P(\emptyset) = 0$.

20.3 Random variables

Often the outcome itself is not the number you care about — you care about some quantity computed from the outcome. Roll two dice; you usually want the sum, not the pair. Run a hash insert; you want the bucket's resulting load, not the full table state. Flip a coin $n$ times; you want the number of heads, not the exact head/tail string. Each of these is a rule that turns an outcome into a number — and "a rule that turns an input into an output" is, from Chapter 9, exactly a function.

Definition (random variable). A random variable on a sample space $S$ is a function $X \colon S \to \mathbb{R}$ that assigns a real number to each outcome. We write $X = a$ for the event $\{s \in S : X(s) = a\}$ — the set of outcomes that $X$ sends to $a$ — and $P(X = a)$ for its probability. The list of values $X$ can take together with their probabilities is the distribution of $X$.

A random variable is not random and not a variable — it is a deterministic function. The randomness lives in which outcome occurs; $X$ merely reads a number off that outcome. The name is historical, and unfortunate, but universal.

💡 Intuition: Think of $X$ as a sensor attached to the experiment. The experiment produces an outcome; the sensor displays a number. $P(X = a)$ asks: across all the equally-uncertain outcomes, what fraction make the sensor read $a$? Because $\{X = a\}$ is just an event (a subset of $S$), computing $P(X = a)$ is the counting you already do — group the outcomes by the value $X$ assigns them.

Worked example (sum of two dice). Let $S = \{1,\dots,6\}^2$, all $36$ equally likely ordered pairs, and let $X$ be the sum of the two dice, $X((i,j)) = i + j$. Then $X$ takes values $2$ through $12$. The event $X = 5$ is $\{(1,4),(2,3),(3,2),(4,1)\}$, four outcomes, so $P(X=5) = 4/36 = 1/9$. Tabulating $P(X = a)$ for every $a$ by counting the pairs on each anti-diagonal gives the familiar triangular distribution:

$a$ 2 3 4 5 6 7 8 9 10 11 12
$P(X=a)$ $\tfrac{1}{36}$ $\tfrac{2}{36}$ $\tfrac{3}{36}$ $\tfrac{4}{36}$ $\tfrac{5}{36}$ $\tfrac{6}{36}$ $\tfrac{5}{36}$ $\tfrac{4}{36}$ $\tfrac{3}{36}$ $\tfrac{2}{36}$ $\tfrac{1}{36}$

The probabilities sum to $\frac{1+2+3+4+5+6+5+4+3+2+1}{36} = \frac{36}{36} = 1$, as they must — the events $\{X = a\}$ for different $a$ partition $S$, so by additivity their probabilities sum to $P(S)=1$. That "they must sum to 1" check is the first thing to verify about any distribution you write down.

One random variable deserves its own name because it appears everywhere, especially in CS: the indicator of an event.

Definition (indicator random variable). For an event $A$, the indicator random variable $\mathbf{1}_A$ (also written $I_A$) is $$\mathbf{1}_A(s) = \begin{cases} 1 & \text{if } s \in A,\\ 0 & \text{if } s \notin A.\end{cases}$$ It "switches on" exactly when $A$ occurs.

Indicators look trivial, but they are the secret behind the most powerful technique in the next section. Here they are in code, computing the distribution of the dice sum directly from the function $X$:

from itertools import product
from fractions import Fraction
from collections import Counter

space = list(product(range(1, 7), repeat=2))   # 36 equally likely (i, j)
X = lambda outcome: outcome[0] + outcome[1]     # the random variable "sum"

counts = Counter(X(s) for s in space)           # group outcomes by X-value
dist = {a: Fraction(c, len(space)) for a, c in sorted(counts.items())}
print(dist[5], dist[7])                          # P(X=5), P(X=7)
print(sum(dist.values()))                        # must be 1
# Expected output:
# 1/9 1/6
# 1

The code mirrors the definition exactly: build the equally-likely space, apply the function $X$ to every outcome, and group by value. Counter does the grouping; Fraction keeps the probabilities exact so we can see that they sum to 1.

🔄 Check Your Understanding 1. For the two-dice sum $X$, compute $P(X = 7)$ and $P(X \ge 11)$ by counting outcomes. 2. Let $Y$ be the number of heads in two coin flips, on $S = \{HH,HT,TH,TT\}$. Write the distribution of $Y$. 3. If $A$ is the event "the die shows an even number," what are the possible values of $\mathbf{1}_A$, and what is $P(\mathbf{1}_A = 1)$?

Answers

  1. $P(X=7) = 6/36 = 1/6$ (the six pairs summing to 7); $P(X \ge 11) = P(X=11)+P(X=12) = 2/36 + 1/36 = 3/36 = 1/12$. 2. $Y$ takes $0,1,2$ with $P(Y=0)=1/4$, $P(Y=1)=2/4=1/2$, $P(Y=2)=1/4$. 3. $\mathbf{1}_A$ takes the values $0$ and $1$; $P(\mathbf{1}_A = 1) = P(A) = 3/6 = 1/2$.

20.4 Expected value and linearity

If you could summarize a random variable with a single number, what would it be? For most purposes the answer is its long-run average — the value you would converge to if you repeated the experiment forever and averaged $X$. That number is the expected value, and it is the most-used quantity in applied probability, statistics, algorithm analysis, and machine learning.

Definition (expected value). The expected value (or expectation, or mean) of a random variable $X$ on a sample space $S$ is the probability-weighted average of its values: $$E[X] = \sum_{s \in S} X(s)\, p(s) = \sum_{a} a \cdot P(X = a),$$ where the second sum runs over the distinct values $a$ that $X$ takes. (The two sums are equal: group the outcome-by-outcome sum by the value $X$ assigns.)

For one fair die, $X$ is the face shown, each value $1,\dots,6$ has probability $1/6$, so $$E[X] = 1\cdot\tfrac16 + 2\cdot\tfrac16 + \cdots + 6\cdot\tfrac16 = \frac{1+2+\cdots+6}{6} = \frac{21}{6} = 3.5.$$ Note the expected value $3.5$ is not a value the die can ever show — expectation is an average, not a prediction of a single roll. For the two-dice sum, weighting each value $2,\dots,12$ by its probability from the table in 20.3 gives $E[X] = 7$ (the symmetric distribution balances at its center).

💡 Intuition: $E[X]$ is the balance point of the distribution. Picture the probabilities as weights placed on a number line at the positions $X$ can take; $E[X]$ is the point where the line balances. For a symmetric distribution it sits at the center of symmetry — which is why the dice sum, symmetric about 7, has expectation exactly 7 without any arithmetic.

A small but constantly used fact: the expectation of an indicator is just the probability of its event.

Indicator expectation. For any event $A$, $\;E[\mathbf{1}_A] = P(A).$

Proof. $\mathbf{1}_A$ takes the value $1$ with probability $P(A)$ and $0$ with probability $1 - P(A)$, so $E[\mathbf{1}_A] = 1 \cdot P(A) + 0 \cdot (1 - P(A)) = P(A)$. $\blacksquare$

This one-liner is the hinge of the whole subject, because of the theorem it unlocks.

Linearity of expectation

Here is the result that makes expectation worth its weight in gold. Expectation respects addition and scaling — always, with no independence assumption whatsoever.

Theorem (linearity of expectation). For random variables $X$ and $Y$ on the same sample space and any constants $a, b \in \mathbb{R}$, $$E[aX + bY] = a\,E[X] + b\,E[Y].$$ More generally, for any random variables $X_1, \dots, X_n$ (independent or not), $$E[X_1 + X_2 + \cdots + X_n] = E[X_1] + E[X_2] + \cdots + E[X_n].$$

The strategy first. Go back to the outcome-by-outcome definition of expectation, $E[Z] = \sum_{s} Z(s)\,p(s)$. The function $(aX + bY)$ evaluated at an outcome $s$ is just $aX(s) + bY(s)$ — ordinary addition of numbers. So the claim reduces to the fact that a sum can be split apart and constants pulled out, which is the algebra of summation from Chapter 11. There is no probability subtlety at all; that is exactly why it needs no independence.

Proof. Let $Z = aX + bY$, so $Z(s) = aX(s) + bY(s)$ for every outcome $s$. Then $$E[Z] = \sum_{s \in S} Z(s)\,p(s) = \sum_{s \in S} \bigl(aX(s) + bY(s)\bigr)p(s).$$ Distribute and split the sum (Chapter 11's summation rules), then factor the constants out of each piece: $$= \sum_{s \in S} aX(s)p(s) + \sum_{s \in S} bY(s)p(s) = a\sum_{s \in S} X(s)p(s) + b\sum_{s \in S} Y(s)p(s) = a\,E[X] + b\,E[Y].$$ The general $n$-variable statement follows by induction on $n$ (Chapter 6), using this two-variable case as the inductive step. $\blacksquare$

🚪 Threshold Concept: linearity holds even when the variables are tangled together. The shocking part is the "independent or not." Most probability identities require independence; linearity does not. This is what makes it a superpower: to find the expected value of a complicated quantity, break it into a sum of simple indicator variables, take the expectation of each one separately (each is just a probability, by the indicator lemma), and add. You never have to understand how the pieces interact — and they usually interact in horrible ways. Once this clicks, a whole class of "impossible" expected values becomes a few lines of arithmetic. This is the single most important technique in the chapter; internalize the pattern "write it as a sum of indicators, then add up the probabilities."

Watch the technique demolish a problem that is genuinely painful by brute force.

Worked example (expected number of fixed points / the hat-check problem). $n$ people check their hats; the hats are returned in a uniformly random order (a random permutation). What is the expected number of people who get their own hat back? Computing the full distribution of "number of correct returns" is a famous mess (it involves derangements, Chapter 17). Linearity makes it trivial.

Let $X$ be the number of people who get their own hat. Define an indicator for each person $i$: $$X_i = \mathbf{1}[\text{person } i \text{ gets their own hat}], \qquad X = X_1 + X_2 + \cdots + X_n.$$ By the indicator lemma, $E[X_i] = P(\text{person } i \text{ gets their own hat})$. By symmetry, person $i$'s hat is equally likely to land in any of the $n$ positions, so this probability is $1/n$. Now apply linearity: $$E[X] = \sum_{i=1}^{n} E[X_i] = \sum_{i=1}^{n} \frac{1}{n} = n \cdot \frac{1}{n} = 1.$$ On average, exactly one person gets their own hat back — independent of $n$, whether the party has 5 people or 5 million. Notice we never needed the (complicated, highly dependent) joint behavior of the $X_i$: if person 1 gets their hat, that changes the odds for everyone else, but linearity does not care.

from itertools import permutations
from fractions import Fraction

def expected_fixed_points(n):
    """E[# of i with perm[i]==i] over all n! permutations of 0..n-1, exactly."""
    perms = list(permutations(range(n)))
    total_fixed = sum(sum(1 for i, v in enumerate(p) if v == i) for p in perms)
    return Fraction(total_fixed, len(perms))

print(expected_fixed_points(4))   # brute force over all 4! = 24 permutations
print(expected_fixed_points(6))
# Expected output:
# 1
# 1

The brute force confirms what linearity proved in two lines: the average over all $n!$ permutations is exactly $1$. The proof settles it for all $n$ at once; the code checks it for $n = 4$ and $6$ — theme four of this book in one example.

Worked example (expected number of empty buckets / hashing). Throw $n$ keys independently and uniformly into $m$ hash buckets. What is the expected number of empty buckets? Again the joint behavior is nasty, and again linearity sidesteps it. Let $X_j = \mathbf{1}[\text{bucket } j \text{ is empty}]$ and $X = \sum_{j=1}^{m} X_j$. Bucket $j$ is empty iff all $n$ keys miss it; each key misses with probability $\frac{m-1}{m}$, and the keys are independent, so $P(\text{bucket } j \text{ empty}) = \left(\frac{m-1}{m}\right)^n = \left(1 - \frac1m\right)^n$. Hence $$E[\text{empty buckets}] = \sum_{j=1}^{m} E[X_j] = m\left(1 - \frac1m\right)^n.$$ For $n = m$ (as many keys as buckets), this is $m(1 - 1/m)^m \approx m/e$ — about 37% of buckets stay empty, a fact every hash-table designer should feel in their bones.

⚠️ Common Pitfall: linearity is for sums, not products. $E[X + Y] = E[X] + E[Y]$ always, but $E[XY] = E[X]\,E[Y]$ requires $X$ and $Y$ to be independent (a notion Chapter 21 makes precise). Likewise $E[g(X)] \ne g(E[X])$ in general for a nonlinear $g$ — for instance $E[X^2] \ne (E[X])^2$, a gap that is precisely what variance measures (next section). When you reach for linearity, make sure the thing you are taking the expectation of is genuinely a sum (or a constant times a variable), not a product or a square.

🔄 Check Your Understanding 1. State linearity of expectation, and the one hypothesis it does not require. 2. A fair coin is flipped 100 times. Using indicators and linearity, find the expected number of heads. (Each flip is heads with probability $1/2$.) 3. In the hat-check problem, why is $E[X_i] = 1/n$ for every person $i$, regardless of how the $X_i$ depend on one another?

Answers

  1. $E[\sum_i X_i] = \sum_i E[X_i]$ (and $E[aX+bY]=aE[X]+bE[Y]$); it does not require the variables to be independent. 2. Let $X_k = \mathbf{1}[\text{flip } k \text{ is heads}]$; $E[X_k] = 1/2$, so the expected number of heads is $\sum_{k=1}^{100} 1/2 = 50$. 3. By symmetry person $i$'s hat is equally likely to go to any of the $n$ positions, so it lands on its owner with probability $1/n$; linearity lets us add these per-person expectations without knowing the joint distribution.

20.5 Variance and standard deviation

Expectation tells you the center of a distribution; it says nothing about the spread. Two random variables can share an expectation yet behave completely differently. Consider a server whose response time is always exactly 100 ms versus one that is 0 ms half the time and 200 ms the other half. Both average $E[X] = 100$ ms — but you would never confuse them in production: the second is a tail-latency nightmare. The number that distinguishes them is the variance.

Definition (variance, standard deviation). The variance of a random variable $X$ with mean $\mu = E[X]$ is the expected squared deviation from the mean, $$\operatorname{Var}(X) = E\bigl[(X - \mu)^2\bigr] = \sum_{a} (a - \mu)^2\, P(X = a).$$ Its non-negative square root $\sigma = \sqrt{\operatorname{Var}(X)}$ is the standard deviation.

Variance measures spread by averaging the squared distance of $X$ from its mean. Squaring serves two purposes: it makes every deviation non-negative (so positive and negative deviations do not cancel), and it punishes large deviations disproportionately (a value twice as far out contributes four times as much). The standard deviation $\sigma$ undoes the squaring to return to the original units (ms, not ms²), which is why it is the spread you usually report.

For the two servers: the constant one has $\operatorname{Var} = 0$ (no spread, ever), while the bimodal one has $\operatorname{Var} = \frac12(0-100)^2 + \frac12(200-100)^2 = \frac12(10000) + \frac12(10000) = 10000$, so $\sigma = 100$ ms. Same mean, wildly different variance — exactly the distinction we needed.

In practice the definition is awkward to compute because it needs $\mu$ before you start. A short algebraic identity fixes that.

Computational formula for variance. $\;\operatorname{Var}(X) = E[X^2] - (E[X])^2.$

The strategy first. Expand the square $(X - \mu)^2 = X^2 - 2\mu X + \mu^2$ inside the expectation, then push the expectation through the sum using linearity (20.4), remembering that $\mu$ is a constant so $E[\mu X] = \mu E[X]$ and $E[\mu^2] = \mu^2$. The middle terms collapse.

Proof. Let $\mu = E[X]$. Using linearity of expectation on the expanded square, $$\operatorname{Var}(X) = E[(X-\mu)^2] = E[X^2 - 2\mu X + \mu^2] = E[X^2] - 2\mu\,E[X] + \mu^2.$$ Since $E[X] = \mu$, the last two terms are $-2\mu \cdot \mu + \mu^2 = -2\mu^2 + \mu^2 = -\mu^2$. Therefore $\operatorname{Var}(X) = E[X^2] - \mu^2 = E[X^2] - (E[X])^2$. $\blacksquare$

This is the form you actually use: compute $E[X]$ and $E[X^2]$ in one pass over the distribution, then subtract. For one fair die, $E[X] = 3.5$ and $E[X^2] = \frac{1^2 + 2^2 + \cdots + 6^2}{6} = \frac{91}{6} \approx 15.17$, so $\operatorname{Var}(X) = \frac{91}{6} - \left(\frac72\right)^2 = \frac{91}{6} - \frac{49}{4} = \frac{182 - 147}{12} = \frac{35}{12} \approx 2.92$, and $\sigma \approx 1.71$.

from fractions import Fraction

def mean_and_variance(dist):
    """dist: {value: probability}. Returns (E[X], Var(X)) exactly."""
    mu = sum(v * p for v, p in dist.items())
    ex2 = sum(v * v * p for v, p in dist.items())
    return mu, ex2 - mu * mu                      # Var = E[X^2] - (E[X])^2

die = {v: Fraction(1, 6) for v in range(1, 7)}
mu, var = mean_and_variance(die)
print(mu, var)
# Expected output:
# 7/2 35/12

The exact arithmetic confirms the hand computation: $E[X] = 7/2$ and $\operatorname{Var}(X) = 35/12$.

💡 Intuition: why we report $\sigma$, not $\operatorname{Var}$. Variance lives in squared units — for response times measured in milliseconds, variance is in ms², which is meaningless to a human. The standard deviation $\sigma = \sqrt{\operatorname{Var}}$ returns to milliseconds and answers the question you actually have: "typically, how far from average is a single observation?" A rough but handy reading: most of the probability of many distributions sits within a couple of standard deviations of the mean.

🔗 Connection. Variance is the lever behind concentration: results like Chebyshev's inequality and the Chernoff bounds use small variance to prove that a random variable is very unlikely to stray far from its mean. That is exactly how we argue a randomized algorithm "almost always" finishes on time or a hash table "almost never" has a badly overloaded bucket. We will not prove those bounds here, but the expectation-and-variance toolkit you are building is precisely their starting point — and Chapter 21 uses these ideas to analyze randomized algorithms like quicksort.

🔄 Check Your Understanding 1. State the computational formula for variance and explain why it is usually easier than the definition. 2. A random variable $X$ has $E[X] = 2$ and $E[X^2] = 6$. Find $\operatorname{Var}(X)$ and $\sigma$. 3. Why can variance never be negative? (Two reasons.)

Answers

  1. $\operatorname{Var}(X) = E[X^2] - (E[X])^2$; it is easier because you accumulate $E[X]$ and $E[X^2]$ in a single pass and subtract, instead of needing $\mu$ first and then a second pass over $(a-\mu)^2$.
  2. $\operatorname{Var}(X) = 6 - 2^2 = 2$, so $\sigma = \sqrt{2} \approx 1.41$. 3. It is the expectation of $(X-\mu)^2$, a sum of non-negative terms weighted by probabilities — a non-negative quantity. (And a squared real number is never negative.)

20.6 The probabilistic method; simulation vs. proof

We close with two ways to get a probability you can trust: an empirical one (simulate the experiment many times and measure) and a theoretical one (prove the value exactly). Then we turn the whole subject inside out with a proof technique that uses probability to establish facts that have nothing to do with chance.

Monte-Carlo simulation: estimating a probability by experiment

If you can run an experiment, you can estimate any of its probabilities: repeat it $k$ times, count how often the event occurs, and divide. The estimate $\hat{p} = (\text{# of times } E \text{ occurred}) / k$ approaches the true $P(E)$ as $k$ grows — this is the law of large numbers, the formal promise that long-run frequency converges to probability. This is Monte-Carlo simulation, and it is how you sanity-check a probability calculation (or estimate one too hard to compute).

import random

def estimate_prob(trial_fn, k):
    """Run trial_fn() k times; return the fraction of trials that returned True."""
    hits = sum(1 for _ in range(k) if trial_fn())
    return hits / k

random.seed(0)
two_dice_sum_is_7 = lambda: random.randint(1, 6) + random.randint(1, 6) == 7
print(round(estimate_prob(two_dice_sum_is_7, 100_000), 3))
# Expected output (true value is 1/6 ≈ 0.167; simulation is close, not exact):
# 0.167

The estimate hovers near the exact answer $1/6 \approx 0.1667$ but will not equal it — and that gap is the point. Simulation gives you evidence, fast and cheap, for any probability you can sample. It does not give you certainty: a different seed gives a slightly different number, and no finite $k$ pins down the exact value. (The error shrinks like $1/\sqrt{k}$, so squeezing one more decimal of accuracy costs about $100\times$ more trials — a steep price that is exactly why proof, when available, wins.)

⚠️ Common Pitfall: simulation suggests, it does not prove. Reporting "I ran it a million times and got 0.167, so the probability is $1/6$" is evidence, not a proof — the same way passing a million test cases is not a correctness proof (theme two of this book). Use simulation to find and check the answer; use the axioms, counting, and linearity to prove it. The two are complementary, and a careful analyst uses both: simulate first to catch gross errors, then derive the exact value.

The probabilistic method: proving existence with probability

Here is one of the most beautiful ideas in discrete mathematics, pioneered by Paul Erdős. Suppose you want to prove that some object with a desired property exists — a graph coloring with few conflicts, a code with large minimum distance, a tournament with a certain structure. Constructing one explicitly can be hard. The probabilistic method sidesteps construction entirely with a one-line logical move.

The probabilistic method (basic form). To prove that an object with property $Q$ exists, choose an object at random from some sample space and show that $P(\text{the random object has property } Q) > 0$. If a randomly chosen object has $Q$ with *positive* probability, then at least one object with $Q$ must exist — because if no object had $Q$, that probability would be exactly $0$.

The strategy first. This is proof by contradiction (Chapter 5) in disguise. If the set of objects with property $Q$ were empty, then the event "the random object has $Q$" would be the impossible event, with probability $0$. So a probability strictly greater than $0$ is logically incompatible with "no such object exists." We never build the object; we only show the alternative is impossible.

Let's apply it to a genuine combinatorics theorem about two-coloring.

Theorem. Consider $n$ items and a collection of $m$ "blocks," each block being a subset of exactly $k$ of the items. If $m < 2^{k-1}$, then it is possible to color every item red or blue so that no block is monochromatic (no block has all $k$ of its items the same color).

The strategy first. Color each item red or blue independently by a fair coin flip. For one fixed block, compute the probability it comes out all-one-color — it is small, $2^{1-k}$. There are $m$ blocks; by the union bound the probability that some block is monochromatic is at most $m \cdot 2^{1-k}$. If $m < 2^{k-1}$, that bound is below $1$, so the probability of a "bad" coloring is less than 1 — which means the probability of a good (no-monochromatic-block) coloring is positive. A good coloring therefore exists.

Proof. Color each of the $n$ items red or blue independently, each color with probability $1/2$. Fix one block $B$ (a set of $k$ items). The block is monochromatic exactly when all $k$ items get red (one specific outcome of $k$ independent fair flips, probability $2^{-k}$) or all get blue (another, probability $2^{-k}$). These are disjoint, so by additivity $$P(B \text{ is monochromatic}) = 2^{-k} + 2^{-k} = 2 \cdot 2^{-k} = 2^{1-k}.$$ Let $A$ be the event "at least one block is monochromatic," i.e. $A = \bigcup_{B} {B \text{ is monochromatic}}$, a union over the $m$ blocks. The probability of a union is at most the sum of the probabilities (this is the union bound, which follows from inclusion–exclusion by dropping the non-negative intersection terms — or from monotonicity and additivity). Hence $$P(A) \le \sum_{B} P(B \text{ is monochromatic}) = m \cdot 2^{1-k}.$$ By hypothesis $m < 2^{k-1}$, so $m \cdot 2^{1-k} < 2^{k-1} \cdot 2^{1-k} = 2^{0} = 1$. Therefore $P(A) < 1$, and so $$P(\text{no block is monochromatic}) = 1 - P(A) > 0.$$ Since a uniformly random coloring is "good" with positive probability, at least one good coloring exists. $\blacksquare$

Read that proof again and notice what it did not do: it never exhibited a single valid coloring. It proved one exists by showing a random coloring works with positive probability, which is logically enough. This is the probabilistic method's signature — and it scales to results, like the existence of good error-correcting codes (Chapter 38) and Ramsey-type bounds (Chapter 40), where no efficient explicit construction is known at all.

🚪 Threshold Concept: probability can prove non-probabilistic theorems. The statement above is pure combinatorics — there is no randomness in "a coloring with no monochromatic block exists." Yet the cleanest proof introduces randomness as a tool and then discards it. This inverts the usual relationship: instead of using math to reason about chance, we use chance to reason about math. Once you see that "expected value $\le c$ implies some outcome is $\le c$" and "positive probability implies existence," you hold two of the most leverage-dense proof techniques in the discipline. They recur in algorithm design, coding theory, and extremal combinatorics for the rest of your career.

🧩 Productive Struggle. Here is the expectation-flavored cousin of the method, worth attempting before you read Chapter 40. If a random variable $X$ has expected value $E[X] = \mu$, prove there is at least one outcome with $X \ge \mu$ and at least one with $X \le \mu$. (Hint: if every outcome had $X < \mu$, what could you say about the weighted average $E[X]$? Contradiction.) This "an outcome at least as good as average must exist" idea is the averaging form of the probabilistic method.

🔄 Check Your Understanding 1. In one sentence, what does the probabilistic method conclude from "$P(\text{random object has } Q) > 0$," and which Chapter 5 technique is it secretly using? 2. Why is "I simulated it 10 million times and never saw a failure" not a proof that failure is impossible? 3. In the two-coloring proof, where exactly did the hypothesis $m < 2^{k-1}$ get used?

Answers

  1. It concludes that an object with property $Q$ exists (because a probability of 0 would be forced if none did); it is proof by contradiction. 2. Simulation samples finitely many cases; "never observed" is consistent with a small-but-positive failure probability — absence of evidence is not a proof of impossibility (you would need the axioms/counting to show the probability is exactly 0).
  2. At the final step: $m < 2^{k-1}$ is exactly what makes the union bound $m \cdot 2^{1-k}$ strictly less than $1$, which is what forces $P(\text{good coloring}) > 0$.

Project Checkpoint: a probability engine for the Toolkit

This chapter opens dmtoolkit/probability.py, the module the style bible reserves for Chapters 20–21. We add the two pillars of this chapter — a Monte-Carlo simulator and an exact expected-value calculator — which together embody theme four: simulate estimates a probability empirically, while expected_value computes an expectation exactly over a finite space. (Chapter 21 will add bayes for conditional reasoning; we leave that slot open here.)

Create dmtoolkit/probability.py and add:

from fractions import Fraction
import random

def simulate(trial_fn, k):
    """Estimate P(event) by running trial_fn() k times.
    trial_fn() returns True when the event occurs. Returns the empirical fraction.
    This ESTIMATES a probability; it does not prove it (see Chapter 20)."""
    return sum(1 for _ in range(k) if trial_fn()) / k

def expected_value(rv, space, weight=None):
    """Exact E[rv] over a finite sample space.
    rv: outcome -> number.  space: iterable of outcomes.
    weight: outcome -> probability (defaults to uniform = equally likely)."""
    space = list(space)
    if weight is None:                       # equally-likely (Laplace) model
        n = len(space)
        return sum(Fraction(rv(s), n) for s in space)
    return sum(rv(s) * weight(s) for s in space)

The two functions are the chapter in code. simulate is the law of large numbers made executable — point it at any samplable experiment and it returns an approximate probability. expected_value implements the definition $E[X] = \sum_s X(s)\,p(s)$ directly, defaulting to the equally-likely model so that a uniform expectation is just the average of $X$ over the outcomes — and using Fraction so the exact case stays exact. Here they are confirming each other on the two-dice sum, whose true expectation we computed as $7$:

from itertools import product
import random
from dmtoolkit.probability import simulate, expected_value

space = list(product(range(1, 7), repeat=2))      # 36 equally likely (i, j)
dice_sum = lambda s: s[0] + s[1]

random.seed(1)
print(expected_value(dice_sum, space))             # exact
print(round(simulate(lambda: (lambda s: dice_sum(s) == 7)
                     ((random.randint(1,6), random.randint(1,6))), 100_000), 3))  # P(sum==7)
# Expected output:
# 7
# 0.167

expected_value returns the exact mean $7$; simulate returns an estimate of $P(\text{sum} = 7) \approx 0.167 \approx 1/6$. The exact value is certain; the simulated one is close but seed-dependent — the proof-vs-computation duality, now a unit test.

Toward the capstone. With simulate and expected_value, probability.py can already analyze the randomized algorithms and hashing scenarios that recur through the rest of the book: estimate a load factor, average a running time, or check a tricky probability before you trust a derivation. Chapter 21 completes the module with conditional probability and bayes, the engine behind the spam-filter and naive-Bayes examples — and the probabilistic reasoning here underpins the randomized analysis you will do on graph and number-theory algorithms later.

Toolkit so far: logic.py (Chapters 1–3), sets.py (Chapter 8), relations.py (Chapters 12–13), combinatorics.py (Chapters 15–17), recurrences.py (Chapters 6, 18–19), and now the start of probability.py (Chapter 20: simulate, expected_value).


Summary

Discrete probability lays a measure on top of the set theory and counting you already know. Reference-grade recap:

The model.

Object Definition
Sample space $S$ set of all mutually-exclusive, exhaustive outcomes
Event $E \subseteq S$ a subset of outcomes; occurs when the outcome is in $E$
Probability function $P$ $P(E) = \sum_{s \in E} p(s)$, with $p(s) \ge 0$ and $\sum_s p(s) = 1$
Equally-likely (Laplace) $P(E) = \lvert E\rvert / \lvert S\rvert$ — only when outcomes are equally likely

Axioms and consequences.

Rule Statement
Non-negativity / normalization $P(E) \ge 0$; $\;P(S) = 1$
Additivity (disjoint $E,F$) $P(E \cup F) = P(E) + P(F)$
Complement $P(\overline{E}) = 1 - P(E)$
Monotonicity $E \subseteq F \Rightarrow P(E) \le P(F)$
Union (inclusion–exclusion) $P(E \cup F) = P(E) + P(F) - P(E \cap F)$
Union bound $P(\bigcup_i E_i) \le \sum_i P(E_i)$

Random variables and their summaries.

Quantity Formula Notes
Random variable $X \colon S \to \mathbb{R}$ a function on outcomes; $\{X=a\}$ is an event
Indicator $\mathbf{1}_A(s) = 1$ if $s\in A$ else $0$ $E[\mathbf{1}_A] = P(A)$
Expected value $E[X] = \sum_s X(s)p(s) = \sum_a a\,P(X=a)$ the balance point / long-run average
Linearity $E[aX+bY] = aE[X]+bE[Y]$ holds with or without independence
Variance $\operatorname{Var}(X) = E[(X-\mu)^2] = E[X^2]-(E[X])^2$ spread; squared units
Standard deviation $\sigma = \sqrt{\operatorname{Var}(X)}$ spread in original units

Method notes.

  • Complement first. For "at least one…" events, compute $1 - P(\text{none})$ (birthday / collision).
  • Write it as a sum of indicators. To get a hard expectation, set $X = \sum_i \mathbf{1}_{A_i}$, then $E[X] = \sum_i P(A_i)$ by linearity — no independence needed (hat-check $\Rightarrow 1$; empty buckets $\Rightarrow m(1-1/m)^n$).
  • Variance: use $E[X^2]-(E[X])^2$ (one pass), and report $\sigma$ (original units).
  • Probabilistic method: $P(\text{random object has } Q) > 0 \Rightarrow$ such an object exists (proof by contradiction); the union bound bounds the "bad" probability.
  • Simulation vs. proof: simulate to estimate/check (error $\sim 1/\sqrt{k}$); prove to settle.

Spaced Review

Retrieval keeps knowledge available. Answer from memory before checking.

  1. (Ch. 16) The probability that a random 5-card hand is a flush uses $\binom{52}{5}$ in the denominator. What does $\binom{52}{5}$ count, and what counting rule gives the numerator $4\binom{13}{5}$?
  2. (Ch. 16) In the birthday problem, the number of ways for $n$ people to have distinct birthdays is $P(365, n)$. Write $P(365, n)$ as a falling product and say why "order matters" here.
  3. (Ch. 15) The equally-likely formula $P(E) = \lvert E\rvert/\lvert S\rvert$ relies on which two Chapter 15 rules to compute $\lvert E\rvert$ and $\lvert S\rvert$ for the two-dice sample space?
  4. (Ch. 15) Why is the two-coin sample space $\{HH,HT,TH,TT\}$ (size 4) and not $\{$0 heads, 1 head, 2 heads$\}$ (size 3)? Tie your answer to a counting principle.

Answers

  1. $\binom{52}{5}$ counts the number of 5-card subsets (unordered hands) of a 52-card deck, $2{,}598{,}960$. The numerator uses the product rule: choose the suit (4 ways) and then choose 5 of that suit's 13 cards ($\binom{13}{5}$ ways), giving $4\binom{13}{5}$. 2. $P(365,n) = 365 \cdot 364 \cdots (365-n+1)$, the top $n$ falling factors of $365!$. Order matters because we build the sample space as ordered birthday-tuples (person 1's birthday, person 2's, …), so distinct assignments are distinct outcomes. 3. The product rule ($\lvert S\rvert = 6 \times 6 = 36$ ordered pairs) and, for an event like "sum $= 5$," counting its outcomes (a small enumeration / sum rule over the pairs). 4. Because the four ordered outcomes $HH,HT,TH,TT$ are equally likely (product rule: $2\times 2$), whereas "1 head" bundles two of them and is therefore twice as likely as "0 heads" — the 3-element set is not an equally-likely space, so $P = \lvert E\rvert/\lvert S\rvert$ would be wrong on it.

What's Next

You can now build a probability space, compute event probabilities by counting, summarize a random variable with its expectation and variance, and even prove things exist by making them random. But every probability so far has been unconditional — computed before any information arrives. The moment you learn something ("the first die shows an even number," "the email contains the word free," "the test came back positive"), the probabilities should update. That updating is conditional probability, and the rule that governs it — Bayes' theorem — is the mathematical heart of spam filters, medical-test interpretation, and the randomized algorithms whose analysis we have only previewed here. Chapter 21 develops conditional probability and independence, exposes the notorious base-rate fallacy, builds a naive-Bayes classifier, and finishes the probability.py module with bayes.