> "When the facts change, I change my mind. What do you do, sir?"
Prerequisites
- 20
Learning Objectives
- Compute conditional probabilities from a sample space and from a joint table, and explain what conditioning does to the sample space.
- Test two events for independence using the product rule, and distinguish independence from mutual exclusivity.
- Apply Bayes' theorem to update a prior into a posterior, including the version that uses the law of total probability for the denominator.
- Diagnose the base-rate fallacy in a medical-test or spam-filter scenario and quantify how a rare condition deflates a positive test's reliability.
- Derive and implement a naive Bayes classifier, and state precisely which independence assumption makes it 'naive'.
- Analyze a randomized algorithm's error or expected running time using conditioning and linearity of expectation.
In This Chapter
Chapter 21: Conditional Probability and Bayes' Theorem
"When the facts change, I change my mind. What do you do, sir?" — commonly attributed to John Maynard Keynes
Overview
Your spam filter just flagged an email. Your fraud-detection model just scored a transaction as suspicious. Your A/B test just reported that variant B converts better. Each of these is a claim about the world made after seeing some evidence — and each is worthless unless you can answer the next question: given what I just observed, how should I revise what I believe? That question — updating a probability in light of new information — is the single most useful idea in this chapter, and arguably in applied probability as a whole. It is what separates a model that reacts to data from a model that merely reports it.
In Chapter 20 you learned to build a probability space and compute the probability of an event before any evidence arrives. This chapter is about after. The central object is conditional probability: the probability of one event given that another has occurred. From it we get independence (when does evidence tell you nothing?), Bayes' theorem (the formula for flipping a conditional probability around), and a cluster of CS applications that are nothing but Bayes in disguise: spam classification, medical-test interpretation, and the analysis of algorithms that flip coins to run faster.
There is a recurring shock in this chapter, and it is worth naming up front: human intuition about conditional probability is systematically wrong, and the errors are not random — they lean in a predictable direction. Doctors, jurors, and engineers all routinely over-trust a positive test for a rare condition. By the end of section 21.4 you will be able to compute, on a napkin, exactly how over-confident that instinct is, and the gap will surprise you every time.
In this chapter you will learn to:
- Compute conditional probabilities directly and recognize conditioning as shrinking the sample space.
- Decide whether two events are independent, and never again confuse "independent" with "disjoint".
- Apply Bayes' theorem to turn a prior belief plus evidence into a posterior belief.
- Detect and quantify the base-rate fallacy in tests, filters, and detectors.
- Build a naive Bayes classifier from scratch and say exactly what the "naive" assumption costs.
- Analyze a randomized algorithm — why Monte Carlo error shrinks with repetition, and why randomized quicksort is fast in expectation.
Learning Paths
🏃 Fast Track: If conditional probability and independence are already comfortable, skim 21.1–21.2, then read 21.3 (Bayes) and 21.4 (base-rate fallacy) carefully — those are the ideas everyone gets wrong — and do the build in 21.5. Attempt the ⭐⭐ and ⭐⭐⭐ exercises.
📖 Standard Path: Read in order. Work every
🔄 Check Your Understandingfrom memory before reading on, and re-derive Bayes' theorem yourself (it is two lines) before looking at ours.🔬 Deep Dive: After the chapter, prove that pairwise independence does not imply mutual independence (a three-event counterexample is in the exercises), and work the full expected-comparison count for randomized quicksort using indicator variables and linearity of expectation.
21.1 Conditional probability
Start with a concrete deck of cards. You draw one card from a well-shuffled standard 52-card deck. The probability it is the ace of spades is $\frac{1}{52}$. Now a friend who peeked tells you, "It's a spade." Now what is the probability it is the ace of spades? It is no longer $\frac{1}{52}$ — there are only 13 spades, each equally likely given the new information, so the answer is $\frac{1}{13}$. The evidence "it's a spade" did not change the card you drew; it changed what you know, and therefore the probability you should assign.
Here is the key mental move. Before the hint, your sample space was all 52 cards. After the hint, you throw away every outcome that contradicts the evidence — every non-spade — and renormalize the probability over what remains. Conditioning shrinks the sample space to the conditioning event and rescales so the probabilities again sum to 1.
💡 Intuition: Conditioning on $B$ means "zoom in on the world where $B$ happened, and treat that world as your new everything." The probability of $A$ given $B$ is the fraction of that zoomed-in world in which $A$ also happens. You are not changing reality — you are changing which slice of it you are measuring.
Let's count it directly to see the formula appear on its own. With $B$ = "the card is a spade" and $A$ = "the card is the ace of spades," the outcomes consistent with $B$ that are also in $A$ are just the ace of spades — one outcome. So $$P(A \mid B) = \frac{\text{outcomes in both } A \text{ and } B}{\text{outcomes in } B} = \frac{\lvert A \cap B \rvert}{\lvert B \rvert} = \frac{1}{13}.$$ Divide top and bottom by $\lvert S \rvert = 52$ and the counts become probabilities: $$P(A \mid B) = \frac{\lvert A \cap B \rvert / \lvert S \rvert}{\lvert B \rvert / \lvert S \rvert} = \frac{P(A \cap B)}{P(B)}.$$ That last form is the definition, and it works even when outcomes are not equally likely (when the counting shortcut from Chapter 20 no longer applies).
Definition (conditional probability). Let $A$ and $B$ be events in a probability space with $P(B) > 0$. The conditional probability of $A$ given $B$ is $$P(A \mid B) = \frac{P(A \cap B)}{P(B)}.$$ If $P(B) = 0$, the conditional probability is undefined — you cannot condition on something that cannot happen.
🔗 Connection. The numerator $P(A \cap B)$ is the probability of the joint event "both $A$ and $B$", built from the intersection of events you met in Chapter 20 (which in turn is the set intersection $\cap$ from set theory). Conditioning is set intersection plus renormalization — nothing more exotic.
Rearranging the definition gives a form you will use constantly — the multiplication rule: $$P(A \cap B) = P(A \mid B)\,P(B) = P(B \mid A)\,P(A).$$ Both products equal $P(A \cap B)$, which is why they equal each other; that symmetry is the seed of Bayes' theorem in 21.3.
Worked example: a joint probability table
Conditional probabilities are easiest to compute when the joint probabilities are laid out in a table. Suppose a hypothetical SaaS company classifies each signup by two attributes: whether the user came from a paid ad ($A$) or organic search ($\overline{A}$), and whether they converted to a paid plan ($C$) or not ($\overline{C}$). Out of all signups:
| Converted $C$ | Not converted $\overline{C}$ | Row total | |
|---|---|---|---|
| Paid $A$ | 0.10 | 0.30 | 0.40 |
| Organic $\overline{A}$ | 0.15 | 0.45 | 0.60 |
| Col total | 0.25 | 0.70 | 1.00 |
Wait — the column totals are $0.25$ and $0.70$, which sum to $0.95$, not $1$. That is a deliberate error to make a point: every joint table's entries must sum to 1. Let's fix the bottom-right cell. The four cells $0.10, 0.30, 0.15, \overline{C}\cap\overline{A}$ must total $1$, and $0.10+0.30+0.15 = 0.55$, so the missing cell is $0.45$, the column total for $\overline{C}$ is $0.30+0.45 = 0.75$, and everything reconciles:
| Converted $C$ | Not converted $\overline{C}$ | Row total | |
|---|---|---|---|
| Paid $A$ | 0.10 | 0.30 | 0.40 |
| Organic $\overline{A}$ | 0.15 | 0.45 | 0.60 |
| Col total | 0.25 | 0.75 | 1.00 |
Now read conditional probabilities straight off the table. "Given a user came from a paid ad, what is the probability they converted?" Condition on the Paid row, whose total is $0.40$: $$P(C \mid A) = \frac{P(C \cap A)}{P(A)} = \frac{0.10}{0.40} = 0.25.$$ "Given a user converted, what is the probability they came from a paid ad?" Now condition on the Converted column, whose total is $0.25$: $$P(A \mid C) = \frac{P(A \cap C)}{P(C)} = \frac{0.10}{0.25} = 0.40.$$ Notice $P(C \mid A) = 0.25 \ne 0.40 = P(A \mid C)$. A conditional probability and its reverse are different numbers. Confusing them is the engine of the base-rate fallacy (21.4), so fix the distinction now: $P(C \mid A)$ divides the corner cell by the row total; $P(A \mid C)$ divides the same corner cell by the column total.
⚠️ Common Pitfall: $P(A \mid B)$ and $P(B \mid A)$ are not interchangeable. "The probability a test is positive given you're sick" is usually high; "the probability you're sick given a positive test" can be low. They share the numerator $P(A \cap B)$ but divide by different denominators. Swapping them is called the prosecutor's fallacy in court and the base-rate fallacy in medicine.
Here is the table computation in code. We represent the joint distribution as a dictionary keyed by the pair of outcomes and compute a conditional by summing the relevant cells.
joint = {
("paid", "convert"): 0.10, ("paid", "no"): 0.30,
("organic", "convert"): 0.15, ("organic", "no"): 0.45,
}
def conditional(joint, target, given):
"""P(target_attr=target_val | given_attr=given_val) from a joint table.
target and given are (index, value) pairs into the dict keys."""
ti, tv = target
gi, gv = given
p_given = sum(p for k, p in joint.items() if k[gi] == gv)
p_both = sum(p for k, p in joint.items() if k[gi] == gv and k[ti] == tv)
return p_both / p_given
# P(convert | paid): target is index 1 == "convert", given index 0 == "paid"
print(round(conditional(joint, (1, "convert"), (0, "paid")), 4))
# Expected output:
# 0.25
The denominator p_given sums the whole Paid row ($0.10 + 0.30 = 0.40$); the numerator picks out the
single Paid ∩ Convert cell ($0.10$); the quotient is $0.25$, matching the hand calculation.
🔄 Check Your Understanding 1. In one sentence, what does conditioning on $B$ do to the sample space? 2. From the corrected table, compute $P(\overline{C} \mid \overline{A})$ — the probability an organic user does not convert. 3. True or false: if $P(A \mid B) = 0.9$ then $P(B \mid A) = 0.9$.
Answers
- It shrinks (restricts) the sample space to the outcomes in $B$ and renormalizes the probabilities so they sum to 1 over that smaller space. 2. $P(\overline{C} \mid \overline{A}) = \frac{0.45}{0.60} = 0.75$. 3. **False** — $P(A\mid B)$ and $P(B\mid A)$ are generally different; they are equal only in special cases (e.g., when $P(A)=P(B)$).
21.2 Independence
Sometimes evidence tells you nothing. If you flip a fair coin twice, learning that the first flip was heads does not change the probability that the second is heads — it is still $\frac{1}{2}$. The two flips are independent. This is the formal counterpart of an idea you already use whenever you assume two random number generator calls, two network packet losses, or two users' actions don't influence each other.
Independence has a clean definition in terms of conditioning: $A$ and $B$ are independent when conditioning on $B$ leaves $A$'s probability unchanged, $P(A \mid B) = P(A)$. Substituting the definition of conditional probability, $\frac{P(A \cap B)}{P(B)} = P(A)$, and clearing the denominator gives the symmetric form that we take as the official definition (it has the advantage of not requiring $P(B) > 0$):
Definition (independence). Two events $A$ and $B$ are independent if $$P(A \cap B) = P(A)\,P(B).$$ Equivalently, when $P(B) > 0$, $P(A \mid B) = P(A)$ (and when $P(A) > 0$, $P(B \mid A) = P(B)$): the occurrence of one event does not change the probability of the other.
The product form is the one to compute with; the conditional form is the one to understand with.
⚠️ Common Pitfall: independent ≠ mutually exclusive. These are nearly opposite ideas, yet they are constantly confused. Mutually exclusive (disjoint) events satisfy $P(A \cap B) = 0$ — they cannot both happen. Independent events satisfy $P(A \cap B) = P(A)P(B)$ — knowing one happened tells you nothing about the other. If $A$ and $B$ are mutually exclusive and both have positive probability, they are about as far from independent as possible: learning $A$ occurred tells you $B$ definitely did not, which is a huge amount of information. Disjoint-with-positive-probability $\Rightarrow$ dependent.
Worked example: testing independence from a table
Return to the SaaS table. Are "paid ad" ($A$) and "converted" ($C$) independent? Compute both sides of the product test. We have $P(A) = 0.40$, $P(C) = 0.25$, and $P(A \cap C) = 0.10$. Then $$P(A)\,P(C) = 0.40 \times 0.25 = 0.10 = P(A \cap C).$$ They match exactly, so in this table paid and converted are independent — the channel a user comes through carries no information about whether they convert. (Equivalently, $P(C \mid A) = 0.25 = P(C)$, which we computed in 21.1.) That is a strong, slightly surprising claim about the business: it says the paid channel is converting at exactly the baseline rate, so it is buying traffic, not quality.
Now a counterexample. Suppose instead the company observed this table:
| Converted $C$ | Not converted $\overline{C}$ | |
|---|---|---|
| Paid $A$ | 0.20 | 0.20 |
| Organic $\overline{A}$ | 0.05 | 0.55 |
Here $P(A) = 0.40$, $P(C) = 0.25$, but $P(A \cap C) = 0.20 \ne 0.10 = P(A)P(C)$. Paid and converted are now dependent: $P(C \mid A) = \frac{0.20}{0.40} = 0.50$, double the baseline $0.25$. Coming from a paid ad doubles the conversion probability. Same marginal totals, completely different story — which is exactly why you must check the joint cell, never just the row and column totals.
def independent(p_a, p_b, p_a_and_b, tol=1e-9):
"""Return True iff events A, B are independent: P(A and B) == P(A)*P(B)."""
return abs(p_a_and_b - p_a * p_b) < tol
print(independent(0.40, 0.25, 0.10)) # the first table
print(independent(0.40, 0.25, 0.20)) # the second table
# Expected output:
# True
# False
We use a tolerance because floating-point products rarely land on an exact decimal; $0.40 \times 0.25$
is representable here, but in general comparing floats with == is a bug waiting to happen.
🚪 Threshold Concept: independence is an assumption you import, not a property you observe. In real data, two events are almost never exactly independent. When a model "assumes independence" — as naive Bayes will in 21.5, as the analysis of hashing does, as nearly every tractable probabilistic model does — it is making a deliberate simplification that trades fidelity for the ability to multiply probabilities instead of estimating an exponential number of joint entries. Recognizing where an independence assumption has been imported, and what it costs when it is false, is one of the most valuable habits a probabilistic modeler can develop. Watch for it for the rest of this chapter and the rest of your career.
A quick note on more than two events. Events $A_1, \dots, A_n$ are mutually independent if every subset multiplies: for any sub-collection, the probability of the intersection equals the product of the probabilities. This is strictly stronger than pairwise independence (every pair multiplies). It is possible for three events to be pairwise independent yet not mutually independent — a classic counterexample appears in the exercises. For the rest of this chapter, when we multiply probabilities of many events (as in naive Bayes), we are invoking mutual independence.
🔄 Check Your Understanding 1. State the product test for independence and the equivalent conditional test. 2. Two events are disjoint and each has probability $0.3$. Are they independent? Justify with the product test. 3. Why can't you decide whether two attributes are independent from the row and column totals alone?
Answers
- Product: $P(A\cap B)=P(A)P(B)$. Conditional: $P(A\mid B)=P(A)$ (equivalently $P(B\mid A)=P(B)$), when the conditioning event has positive probability. 2. No. Disjoint means $P(A\cap B)=0$, but $P(A)P(B)=0.3\times0.3=0.09\ne 0$, so the product test fails — they are dependent. 3. The totals fix the marginals $P(A)$ and $P(C)$ but not the joint cell $P(A\cap C)$; many different joint tables share the same margins (we saw two), and independence is a statement about the joint cell relative to the product of margins.
21.3 Bayes' theorem
Here is the problem Bayes' theorem solves, stated as a programmer would meet it. You can easily measure $P(\text{evidence} \mid \text{cause})$ — for a disease test, the lab tells you $P(\text{positive} \mid \text{sick})$, the test's detection rate. But what you actually want, when a specific patient tests positive, is the reverse: $P(\text{sick} \mid \text{positive})$. You have the conditional pointing one way and you need it pointing the other way. Bayes' theorem is the formula that flips it.
The derivation is two lines, and you have already seen both. Recall the multiplication rule from 21.1, written two ways because $P(A \cap B)$ doesn't care about order: $$P(A \mid B)\,P(B) = P(A \cap B) = P(B \mid A)\,P(A).$$ Take the outer two expressions and divide both sides by $P(B)$:
Bayes' Theorem. For events $A$ and $B$ with $P(B) > 0$, $$P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B)}.$$
That is the whole theorem. There is no deep machinery — it is the definition of conditional probability applied twice and divided. What makes it profound is how you read it, not how you prove it. The three ingredients have names that turn the formula into a story about learning:
- $P(A)$ is the prior — what you believed about $A$ before seeing the evidence $B$.
- $P(B \mid A)$ is the likelihood — how probable the evidence is if $A$ is true.
- $P(A \mid B)$ is the posterior — your updated belief about $A$ after seeing the evidence.
Definition (prior and posterior). In a Bayesian update, the prior probability $P(A)$ is the probability assigned to a hypothesis $A$ before observing evidence; the posterior probability $P(A \mid B)$ is the probability assigned to $A$ after observing evidence $B$. Bayes' theorem is the rule that maps prior to posterior.
💡 Intuition: Read Bayes as $\text{posterior} \propto \text{likelihood} \times \text{prior}$. The evidence reweights your prior: hypotheses that predicted the evidence well (high likelihood) get boosted; hypotheses that didn't get suppressed. The denominator $P(B)$ is just the normalizer that makes the posterior probabilities sum to 1 across all hypotheses.
The denominator: the law of total probability
Often you cannot read $P(B)$ off the problem directly — but you can compute it from the priors and likelihoods you already have, using the law of total probability. The idea: $B$ happens either with $A$ or without $A$, and those two routes are disjoint and cover everything, so add them: $$P(B) = P(B \mid A)\,P(A) + P(B \mid \overline{A})\,P(\overline{A}).$$ Substituting into Bayes gives the form you will actually use: $$P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B \mid A)\,P(A) + P(B \mid \overline{A})\,P(\overline{A})}.$$ The numerator is one term of the denominator. That structural fact — numerator is the matching slice of the denominator — is the single most reliable way to remember the formula under pressure.
The strategy first. We will prove the law of total probability for the two-case split, because it is the workhorse behind every Bayes computation in this chapter. The trick is to slice the event $B$ into the part inside $A$ and the part outside $A$; those slices are disjoint, so their probabilities add (additivity, from the Chapter 20 axioms), and each slice is exactly a multiplication-rule product.
Theorem (law of total probability, two-case form). For any events $A$ and $B$ with $0 < P(A) < 1$, $$P(B) = P(B \mid A)\,P(A) + P(B \mid \overline{A})\,P(\overline{A}).$$
Proof. The events $A$ and $\overline{A}$ partition the sample space: every outcome is in exactly one of them. Intersecting with $B$, the events $B \cap A$ and $B \cap \overline{A}$ are disjoint, and their union is $B$ (each outcome of $B$ is either in $A$ or not). By the additivity axiom from Chapter 20, for disjoint events probabilities add: $$P(B) = P(B \cap A) + P(B \cap \overline{A}).$$ Now apply the multiplication rule (21.1) to each term: $P(B \cap A) = P(B \mid A)\,P(A)$ and $P(B \cap \overline{A}) = P(B \mid \overline{A})\,P(\overline{A})$ (both conditionals are defined because $0 < P(A) < 1$ forces $P(A) > 0$ and $P(\overline{A}) > 0$). Substituting, $$P(B) = P(B \mid A)\,P(A) + P(B \mid \overline{A})\,P(\overline{A}). \qquad \blacksquare$$
The same argument with a partition into $n$ cases $A_1, \dots, A_n$ (pairwise disjoint, covering everything) gives the general law $P(B) = \sum_{i=1}^{n} P(B \mid A_i)\,P(A_i)$, and the general Bayes' theorem $P(A_j \mid B) = \frac{P(B \mid A_j)P(A_j)}{\sum_i P(B \mid A_i)P(A_i)}$. We will use the two-case version throughout 21.4 and the $n$-case version implicitly in 21.5.
Worked example: which bag?
You have two indistinguishable bags. Bag 1 holds 7 red and 3 blue marbles; Bag 2 holds 2 red and 8 blue. You pick a bag uniformly at random, draw one marble, and it is red. What is the probability you drew from Bag 1?
Set $A$ = "chose Bag 1" and $B$ = "drew red." The prior is $P(A) = P(\overline{A}) = \frac{1}{2}$. The likelihoods come from the bag compositions: $P(B \mid A) = \frac{7}{10}$ and $P(B \mid \overline{A}) = \frac{2}{10}$. By the law of total probability, $$P(B) = \tfrac{7}{10}\cdot\tfrac{1}{2} + \tfrac{2}{10}\cdot\tfrac{1}{2} = \tfrac{7}{20} + \tfrac{2}{20} = \tfrac{9}{20}.$$ Then Bayes: $$P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)} = \frac{\frac{7}{10}\cdot\frac{1}{2}}{\frac{9}{20}} = \frac{7/20}{9/20} = \frac{7}{9} \approx 0.78.$$ The red draw shifted your belief from a prior of $\frac{1}{2}$ to a posterior of $\frac{7}{9}$ — sensible, since Bag 1 is the red-heavy bag and you drew red. The evidence pulled belief toward the hypothesis that predicted it.
def bayes(p_a, p_b_given_a, p_b):
"""Posterior P(A | B) from prior P(A), likelihood P(B|A), evidence P(B)."""
return (p_b_given_a * p_a) / p_b
p_a = 0.5
p_b_given_a = 0.7
p_b_given_not_a = 0.2
p_b = p_b_given_a * p_a + p_b_given_not_a * (1 - p_a) # law of total probability
print(round(bayes(p_a, p_b_given_a, p_b), 4))
# Expected output:
# 0.7778
We compute $P(B) = 0.7(0.5) + 0.2(0.5) = 0.45$ first, then bayes(0.5, 0.7, 0.45) $= 0.35/0.45 =
0.7\overline{7}$, which rounds to $0.7778$. This bayes function is exactly the Toolkit increment for
this chapter — see the Project Checkpoint.
🔄 Check Your Understanding 1. Write Bayes' theorem and label each of the four pieces (prior, likelihood, evidence, posterior). 2. In the bag example, recompute the posterior $P(\text{Bag 1} \mid \text{blue})$ if the draw had been blue instead of red. 3. Why is the numerator of the expanded Bayes formula always one of the terms in its denominator?
Answers
- $P(A\mid B) = \frac{P(B\mid A)\,P(A)}{P(B)}$: prior $P(A)$, likelihood $P(B\mid A)$, evidence $P(B)$, posterior $P(A\mid B)$. 2. $P(\text{blue}\mid \text{Bag 1})=\frac{3}{10}$, $P(\text{blue}\mid\text{Bag 2})=\frac{8}{10}$, so $P(\text{blue})=\frac{3}{20}+\frac{8}{20}=\frac{11}{20}$ and the posterior is $\frac{3/20}{11/20}=\frac{3}{11}\approx0.27$ — blue evidence pulls belief away from the red-heavy Bag 1. 3. Because the denominator $P(B)$ is computed by the law of total probability as a sum over all hypotheses, and the numerator is the single term corresponding to the hypothesis $A$ in question.
21.4 The base-rate fallacy (medical tests and spam)
Now we cash in. This is the section where Bayes' theorem stops being algebra and starts overturning intuition. Consider a disease that affects 1 in 1,000 people. There is a test that is "99% accurate," by which we mean: if you have the disease, it is positive 99% of the time (sensitivity $0.99$), and if you don't, it is negative 99% of the time (so it has a 1% false-positive rate). You test positive. What is the probability you actually have the disease?
Almost everyone — including, in published studies, a majority of physicians — answers around 99%, or at least "very high." The correct answer is about 9%. Let's see why, two ways.
Way 1: Bayes directly. Let $D$ = "has disease," $T$ = "tests positive." The prior, also called the base rate, is $P(D) = 0.001$. The likelihoods are $P(T \mid D) = 0.99$ (true-positive rate) and $P(T \mid \overline{D}) = 0.01$ (false-positive rate). The law of total probability gives the evidence: $$P(T) = P(T \mid D)P(D) + P(T \mid \overline{D})P(\overline{D}) = (0.99)(0.001) + (0.01)(0.999).$$ Compute the two terms: $(0.99)(0.001) = 0.00099$ and $(0.01)(0.999) = 0.00999$, so $P(T) = 0.01098$. Then $$P(D \mid T) = \frac{P(T \mid D)P(D)}{P(T)} = \frac{0.00099}{0.01098} \approx 0.0902.$$ About 9%, not 99%.
Definition (base rate). The base rate of an event is its unconditional (prior) probability in the population — $P(D)$ — before any test or evidence is applied. The base-rate fallacy is the error of neglecting this prior when interpreting evidence, typically by treating $P(D \mid T)$ as if it equaled the test accuracy $P(T \mid D)$.
Way 2: natural frequencies (the intuition pump). Forget the formula; imagine 100,000 people.
- $0.1\%$ have the disease: that is $100$ sick people. The test catches $99\%$ of them: $99$ true positives.
- $99{,}900$ are healthy. The test wrongly flags $1\%$ of them: $999$ false positives.
- Total positives: $99 + 999 = 1{,}098$. Of those, only $99$ are actually sick.
So $P(\text{sick} \mid \text{positive}) = \frac{99}{1098} \approx 9\%$ — the same answer, now obvious. The false positives swamp the true positives, not because the test is bad, but because the healthy group is so much larger that even a tiny 1% error rate produces ten times more false alarms than there are real cases.
💡 Intuition: When the condition is rare, the denominator of Bayes is dominated by false positives from the huge healthy majority. A test's accuracy is a likelihood, $P(T \mid D)$; the thing you care about is a posterior, $P(D \mid T)$; and the base rate is precisely the bridge between them. Ignore the base rate and you confuse a 99%-accurate test with a 99%-reliable diagnosis. They are not the same — they differ here by a factor of more than ten.
This is not a quirk of medicine — it is the structural reason rare-event detection is hard, and it is everywhere in CS:
🔗 Connection: spam filters, fraud, and intrusion detection all live here. A fraud detector that is "99.9% accurate" sounds great until you remember fraudulent transactions might be 0.1% of all transactions; the same arithmetic shows most flagged transactions are legitimate. Intrusion-detection systems drown analysts in false positives for exactly this reason. The general lesson — a detector for a rare event must have an extraordinarily low false-positive rate to be useful — is a base-rate argument, and it shapes how every real classifier's threshold is tuned (the precision/recall tradeoff you may have met in ML is this same Bayes computation wearing different notation).
def disease_posterior(base_rate, sensitivity, false_pos_rate):
"""P(disease | positive) via Bayes + law of total probability."""
p_pos = sensitivity * base_rate + false_pos_rate * (1 - base_rate)
return (sensitivity * base_rate) / p_pos
print(round(disease_posterior(0.001, 0.99, 0.01), 4)) # rare disease
print(round(disease_posterior(0.10, 0.99, 0.01), 4)) # 10% base rate instead
# Expected output:
# 0.0902
# 0.9167
The second line is the punchline: hold the test fixed at 99% and raise the base rate from $0.1\%$ to $10\%$, and the posterior jumps from $9\%$ to about $92\%$. The test didn't change — the prior did. This is also why doctors don't run rare-disease screens on the general population; they first raise the base rate by testing only patients with symptoms, which is itself a conditioning step.
⚠️ Common Pitfall: the prosecutor's fallacy. In court, "the probability of this DNA match by chance is 1 in a million" ($P(\text{match} \mid \text{innocent})$) is routinely misreported as "the probability the defendant is innocent is 1 in a million" ($P(\text{innocent} \mid \text{match})$). These differ by the base rate of guilt in the relevant suspect pool — exactly the disease computation with different labels. The fallacy has led to real wrongful convictions. Conditional probability is not an academic nicety.
🔄 Check Your Understanding 1. Using natural frequencies, why does the rare-disease test produce ~9% even though it is 99% accurate? 2. Which probability is the "base rate" in the disease example, and which is the test's "sensitivity"? 3. If you retest and get a second independent positive, your new prior is the old posterior (~0.09). Without computing exactly, will the second posterior be higher or lower than 0.09?
Answers
- Among 100,000 people there are only ~100 sick (giving ~99 true positives) but ~99,900 healthy (giving ~999 false positives at a 1% error rate); the false positives outnumber the true positives ~10:1, so only ~99/1098 ≈ 9% of positives are real. 2. Base rate = $P(D) = 0.001$; sensitivity = $P(T\mid D) = 0.99$. 3. Higher — the second positive is more evidence; starting from a prior of 0.09 (much larger than 0.001) and applying the same likelihoods pushes the posterior well above 0.09 (it works out to about 0.91). Repeated independent positives compound.
21.5 Naive Bayes classification
Bayes' theorem with one piece of evidence is a calculator trick. Bayes' theorem with many pieces of evidence — many features — is a machine-learning classifier. The canonical CS application is the spam filter, and it is worth building because it shows both the power of Bayes and the precise cost of an independence assumption.
The task: classify an email as spam ($S$) or ham ($\overline{S}$, legitimate) from the words it contains. Let the email's words be $w_1, w_2, \dots, w_n$. We want the posterior probability of spam given the words: $$P(S \mid w_1, \dots, w_n) = \frac{P(w_1, \dots, w_n \mid S)\,P(S)}{P(w_1, \dots, w_n)}.$$ The prior $P(S)$ is easy (the fraction of training mail that was spam). The problem is the likelihood $P(w_1, \dots, w_n \mid S)$ — the probability of seeing that exact combination of words in a spam message. There are astronomically many word combinations; we could never estimate the joint probability of every one from any finite training set. (This is the combinatorial explosion you met when counting: the number of possible word-sets is exponential in the vocabulary size.)
Here is where we import an independence assumption — the "naive" in naive Bayes. We pretend the words appear independently of one another given the class. Then the joint likelihood factors into a product of per-word likelihoods: $$P(w_1, \dots, w_n \mid S) \approx \prod_{i=1}^{n} P(w_i \mid S).$$
Definition (naive Bayes). A naive Bayes classifier predicts the class $c$ maximizing the posterior $P(c \mid \text{features})$, computed via Bayes' theorem under the naive assumption that the features are mutually conditionally independent given the class: $P(f_1, \dots, f_n \mid c) = \prod_i P(f_i \mid c)$. The predicted class is $\arg\max_c\ P(c)\prod_i P(f_i \mid c)$.
The assumption is false — "free" and "money" are obviously correlated in spam, not independent. Yet the classifier works remarkably well in practice, for a subtle reason: we only need the ranking of the two posteriors to be correct (is spam more probable than ham?), and the independence error often distorts both class scores in similar ways, leaving the comparison intact. This is the recurring lesson made concrete: an imported independence assumption can be wrong about the probabilities yet right about the decision.
💡 Intuition: "Naive" Bayes treats each word as casting an independent vote. The word "viagra" votes loudly for spam (high $P(w \mid S)$ relative to $P(w \mid \overline{S})$); the word "meeting" votes for ham. The classifier multiplies all the votes (weighted by the prior) and picks the class with the bigger product. It ignores that "viagra" and "pills" tend to co-occur — that is the naivety, and the price we pay for being able to estimate one word at a time.
A worked classification
Because we only need to compare spam and ham, the shared denominator $P(w_1, \dots, w_n)$ cancels, and we compare the unnormalized scores $P(c)\prod_i P(w_i \mid c)$ for each class. Suppose training gives priors $P(S) = 0.4$, $P(\overline{S}) = 0.6$, and these word likelihoods:
| word | $P(w \mid S)$ | $P(w \mid \overline{S})$ |
|---|---|---|
| free | 0.30 | 0.05 |
| money | 0.20 | 0.04 |
An email contains exactly "free" and "money". Score each class: $$\text{score}(S) = P(S)\,P(\text{free}\mid S)\,P(\text{money}\mid S) = 0.4 \times 0.30 \times 0.20 = 0.0240,$$ $$\text{score}(\overline{S}) = P(\overline{S})\,P(\text{free}\mid \overline{S})\,P(\text{money}\mid \overline{S}) = 0.6 \times 0.05 \times 0.04 = 0.0012.$$ Since $0.0240 > 0.0012$, classify as spam. If you want a calibrated probability, normalize: $P(S \mid \text{words}) = \frac{0.0240}{0.0240 + 0.0012} = \frac{0.0240}{0.0252} \approx 0.952$ — about 95% spam.
def naive_bayes_score(prior, likelihoods, words):
"""Unnormalized class score: prior * product of P(word | class)."""
score = prior
for w in words:
score *= likelihoods[w]
return score
p_word_spam = {"free": 0.30, "money": 0.20}
p_word_ham = {"free": 0.05, "money": 0.04}
email = ["free", "money"]
s = naive_bayes_score(0.4, p_word_spam, email)
h = naive_bayes_score(0.6, p_word_ham, email)
print(round(s, 4), round(h, 4), "spam" if s > h else "ham")
# Expected output:
# 0.024 0.0012 spam
The score for spam, $0.4 \times 0.30 \times 0.20 = 0.024$, beats the score for ham, $0.6 \times 0.05
\times 0.04 = 0.0012$, so the classifier returns spam.
Two production wrinkles worth knowing (both are in the exercises). First, the zero-frequency problem:
if a word never appeared in spam during training, its $P(w \mid S)$ is $0$, and that single zero
annihilates the entire product — one unseen word vetoes everything. The fix is Laplace smoothing:
add a small count to every word so no probability is ever exactly zero. Second, floating-point
underflow: multiplying hundreds of small probabilities drives the product to $0$ in floating point, so
real implementations sum log-probabilities instead, $\log P(c) + \sum_i \log P(w_i \mid c)$, turning
the product into a numerically safe sum (and argmax is unchanged because $\log$ is increasing).
🔗 Connection. Naive Bayes was the workhorse of the first effective spam filters in the early 2000s and is still a strong, fast baseline for text classification. It is the simplest member of a family of probabilistic graphical models; the independence assumptions get progressively relaxed as the models grow more expressive (and more expensive). Knowing exactly which independence you assumed — and that you assumed it conditional on the class — is what lets you reason about when the model will fail.
🔄 Check Your Understanding 1. What exactly is the "naive" assumption, and on what is the independence conditioned? 2. Why can we ignore the denominator $P(w_1, \dots, w_n)$ when classifying? 3. A test email contains a word never seen in training spam. What happens to the spam score, and what fixes it?
Answers
- That the features (words) are mutually independent given the class — i.e., $P(w_1,\dots,w_n\mid c) = \prod_i P(w_i\mid c)$, conditioned on the class label $c$ (spam or ham). 2. It is the same positive constant for both classes, so it does not affect which class has the larger posterior; we only need to compare unnormalized scores (and can renormalize at the end if we want an actual probability). 3. Its $P(w\mid S)=0$ makes the entire spam product $0$, vetoing the class; Laplace (add-one) smoothing ensures every probability is strictly positive.
21.6 Randomized algorithms (and why they work)
So far probability has described uncertain inputs. Now we turn it inward: algorithms that deliberately flip coins themselves. A randomized algorithm makes random choices during execution, so its behavior — running time, or even correctness — is a random variable even on a fixed input. Why would you ever want that? Because randomness can be dramatically simpler and faster than the best deterministic alternative, and conditional probability is exactly the tool for proving it works. There are two flavors:
- A Las Vegas algorithm is always correct but has a random running time (it is fast in expectation). Randomized quicksort is the classic example.
- A Monte Carlo algorithm runs in bounded time but is occasionally wrong with some bounded probability — and you can drive that probability as low as you like by repeating.
Monte Carlo: cheap repetition crushes error
Suppose you have a Monte Carlo test that answers a yes/no question and is wrong with probability at most $p = \frac{1}{2}$ on a "yes" instance (it might miss), but never wrong on a "no" instance (this one-sided setup is common — think of a primality test that never calls a prime "composite"). Run it $k$ times independently and answer "yes" if any run says yes. The combined test is wrong only if every one of the $k$ independent runs missed: $$P(\text{all } k \text{ runs wrong}) = \prod_{i=1}^{k} P(\text{run } i \text{ wrong}) \le \left(\tfrac{1}{2}\right)^{k}.$$ The first equality is independence — the runs use fresh randomness, so they are mutually independent — and the bound is the per-run error. With $k = 20$ runs the error is at most $2^{-20} < 10^{-6}$: less likely than a hardware glitch. Error that halves with every cheap repetition is why Monte Carlo algorithms are practical; this exact argument underlies the Miller–Rabin primality test that real cryptographic libraries use to find the large primes for RSA.
🔗 Connection. This is theme four of the book in algorithmic form. A Monte Carlo algorithm tests — it can be wrong, but you bound the error probability and shrink it by repetition; a proof settles. The primality tests behind the keys in Part IV are Monte Carlo: they don't prove a number prime, they make the probability of error smaller than you could ever measure, which for engineering purposes is enough.
Las Vegas: randomized quicksort is fast in expectation
Quicksort partitions an array around a pivot. If you always pick the first element as pivot, adversarial input (an already-sorted array) forces the worst case: every partition is maximally unbalanced, giving $\Theta(n^2)$ comparisons. Randomized quicksort picks the pivot uniformly at random, which means no fixed input can be its worst case — the worst case now depends on the coin flips, not the input, and a bad sequence of flips is wildly unlikely. The headline result:
Theorem. Randomized quicksort makes $O(n \log n)$ comparisons in expectation on any input of $n$ distinct elements.
The strategy first. We will not re-derive the full $O(n \log n)$ here, but we will prove the core fact that makes it work: the expected total number of comparisons is a sum, over all pairs of elements, of the probability that the pair is ever compared. The two tools are indicator random variables and linearity of expectation from Chapter 20 — linearity is what lets us add up these pairwise probabilities even though the comparison events are highly dependent. This "sum of pairwise probabilities" reframing is the signature move of randomized-algorithm analysis.
Setup. Rename the elements by their sorted order as $z_1 < z_2 < \dots < z_n$. For $i < j$, define
the indicator random variable
$$X_{ij} = \begin{cases} 1 & \text{if } z_i \text{ and } z_j \text{ are ever compared,} \\ 0 &
\text{otherwise.}\end{cases}$$
Every comparison in quicksort is between some pair, and (this is the crucial structural fact) any
particular pair is compared at most once — comparisons only ever happen against a pivot, and a chosen
pivot is removed from future partitions. So the total number of comparisons is exactly $X = \sum_{i<j}
X_{ij}$, and by **linearity of expectation** (Chapter 20 — it holds *even though the $X_{ij}$ are not
independent*),
$$E[X] = \sum_{i The pairwise probability. Consider the elements $z_i, z_{i+1}, \dots, z_j$ — there are $j - i + 1$ of
them. Here is the key observation: $z_i$ and $z_j$ are compared if and only if one of them is chosen
as a pivot before any element strictly between them is. Why? As long as the chosen pivots all fall
outside the range $[z_i, z_j]$, the whole block stays together in one partition. The first pivot chosen
from within the block $\{z_i, \dots, z_j\}$ decides their fate: if it is $z_i$ or $z_j$, those two get
compared (pivot against everything in the block); if it is some $z_m$ with $i < m < j$, then $z_i$ and
$z_j$ land in different partitions and are never compared afterward. By symmetry every element of the
$(j-i+1)$-element block is equally likely to be the first pivot chosen from it, so
$$P(z_i \text{ and } z_j \text{ compared}) = \frac{2}{j - i + 1}$$
(the favorable cases being "$z_i$ first" or "$z_j$ first" — two out of $j - i + 1$). Summing,
$$E[X] = \sum_{i 💡 Intuition: Randomization doesn't make quicksort fast on average over inputs — it makes it fast
in expectation over its own coin flips, on every input. That distinction matters: an adversary who
sees your code cannot craft a worst-case input, because the worst case now lives in your random
number generator, which the adversary cannot control. Randomness buys you robustness against
adversarial inputs — a recurring reason to reach for it. We can confirm the spirit of the analysis by simulation, using the On the sorted input that would force $\Theta(n^2) = 2500$ comparisons for a fixed first-element pivot,
the random-pivot average is far below $2500$ (empirically a few hundred, consistent with $n \log_2 n
\approx 50 \times 5.6 \approx 282$); the `True` confirms the average is well under the $n^2$ worst case.
The simulation tests the theorem; the indicator-variable argument proves it. 🔗 Connection (Fibonacci, lightly). Randomized data structures lean on the same expectation
machinery: a skip list flips a coin at each insertion to decide an element's height, and its expected
search cost is $O(\log n)$ by an argument with the same shape as the one above. Even the golden ratio
behind Fibonacci (Chapter 19) reappears in randomized settings — Fibonacci hashing multiplies keys by
the fractional part of the golden ratio to scatter them uniformly, a deterministic trick that mimics
what randomization buys. Probability and the book's running threads keep meeting. 🔄 Check Your Understanding
1. Distinguish a Las Vegas algorithm from a Monte Carlo algorithm.
2. In the quicksort analysis, why is linearity of expectation essential — what would go wrong if we
needed the $X_{ij}$ to be independent?
3. Why does random pivoting defeat an adversary who knows your source code? This chapter completes the Add to We hand-trace the call: How this builds toward the capstone. Toolkit so far: Conditional probability is the mathematics of updating belief in light of evidence. The reference facts: Core definitions Key distinctions to keep straight Applications built Toolkit additions ( Answer from memory before expanding. These revisit Chapters 20 and 16. We have spent two chapters quantifying uncertainty; from here the book turns to a domain where
certainty is the whole point. Part IV opens with number theory — divisibility, primes, and the
Euclidean algorithm — and builds, chapter by chapter, toward a working implementation of RSA, the
public-key cryptosystem that secures the internet. The connection back to this chapter is closer than it
looks: RSA's key generation depends on finding enormous prime numbers, and the only practical way to do
that at scale is a Monte Carlo primality test — the randomized, occasionally-wrong-but-overwhelmingly-likely-right
algorithm whose error analysis you just learned. Probability does not vanish when we reach cryptography;
it quietly makes the keys. Chapter 22 begins with the most basic relation between integers: divisibility.
simulate helper from Chapter 20's
probability.py (we count comparisons over many random pivot sequences and average):import random
def quicksort_comparisons(arr):
"""Sort by random-pivot quicksort; return the comparison count."""
if len(arr) <= 1:
return 0
pivot = arr[random.randrange(len(arr))]
less = [x for x in arr if x < pivot]
equal = [x for x in arr if x == pivot]
greater = [x for x in arr if x > pivot]
here = len(arr) - 1 # this partition compares pivot to the other n-1 elements
return here + quicksort_comparisons(less) + quicksort_comparisons(greater)
# Average over many runs on the SAME sorted input (the deterministic worst case):
data = list(range(50))
avg = sum(quicksort_comparisons(data[:]) for _ in range(1000)) / 1000
print(50 * 50, avg < 50 * 50) # n^2 = 2500 vs. the average; expect avg far below
# Expected output:
# 2500 True
Answers
Project Checkpoint:
bayes() for the Toolkitprobability.py module (begun in Chapter 20 with simulate and
expected_value) by adding the Bayesian update at the heart of every classifier and detector you
built above. We add bayes(pa, pb_given_a, pb) with the canonical signature, plus a convenience that
computes the evidence term via the law of total probability so callers don't have to.dmtoolkit/probability.py:def bayes(pa, pb_given_a, pb):
"""Posterior P(A | B) = P(B|A) * P(A) / P(B).
pa = prior P(A); pb_given_a = likelihood P(B|A); pb = evidence P(B)."""
if pb == 0:
raise ValueError("P(B) must be positive to condition on B")
return pb_given_a * pa / pb
def total_probability(pa, pb_given_a, pb_given_not_a):
"""Evidence P(B) = P(B|A)P(A) + P(B|~A)P(~A), the Bayes denominator."""
return pb_given_a * pa + pb_given_not_a * (1 - pa)
def bayes_update(pa, pb_given_a, pb_given_not_a):
"""Posterior P(A|B) computing the denominator for you (two-hypothesis case)."""
pb = total_probability(pa, pb_given_a, pb_given_not_a)
return bayes(pa, pb_given_a, pb)
# The rare-disease test from 21.4: base rate 0.1%, 99% sensitivity, 1% false-positive rate
print(round(bayes_update(0.001, 0.99, 0.01), 4))
# Expected output:
# 0.0902
total_probability(0.001, 0.99, 0.01) $= 0.99(0.001) + 0.01(0.999) = 0.00099 +
0.00999 = 0.01098$; then `bayes(0.001, 0.99, 0.01098)` $= 0.99 \times 0.001 / 0.01098 = 0.00099 /
0.01098 \approx 0.0902$. The guard against pb == 0 enforces the definition's requirement that you
cannot condition on an impossible event.bayes and bayes_update are the inference core of any
probabilistic component in your capstone — a spam classifier (Track-style text analysis), a randomized
primality check feeding the RSA key generator of Part IV, or any detector whose reliability you must
honestly report (the base-rate computation tells a user what a positive result actually means).
Combined with simulate from Chapter 20, your probability.py can now both estimate a probability by
Monte Carlo and update it with Bayes — the two halves of practical probabilistic reasoning.
logic.py (Ch. 1–3), sets.py (Ch. 8), relations.py (Ch. 12–13),
combinatorics.py (Ch. 15–17), recurrences.py (Ch. 6, 18–19), and now a complete probability.py
(Ch. 20–21) with simulation, expectation, and Bayesian updating.
Summary
Concept
Definition
Read it as
Conditional probability
$P(A \mid B) = \dfrac{P(A \cap B)}{P(B)}$, $P(B)>0$
"shrink the sample space to $B$, renormalize"
Multiplication rule
$P(A \cap B) = P(A\mid B)P(B) = P(B\mid A)P(A)$
the definition, rearranged
Independence
$P(A \cap B) = P(A)P(B)$
evidence carries no information: $P(A\mid B)=P(A)$
Law of total probability
$P(B) = P(B\mid A)P(A) + P(B\mid \overline{A})P(\overline{A})$
sum over a partition of cases
Bayes' theorem
$P(A \mid B) = \dfrac{P(B\mid A)P(A)}{P(B)}$
$\text{posterior} \propto \text{likelihood}\times\text{prior}$
probability.py): bayes(pa, pb_given_a, pb),
total_probability(pa, pb_given_a, pb_given_not_a), bayes_update(pa, pb_given_a, pb_given_not_a).
Spaced Review
Answers
What's Next