Exercises: Conditional Probability and Bayes' Theorem
These build from mechanical warm-ups to full derivations, code, and modeling. Difficulty: ⭐ foundational,
⭐⭐ intermediate, ⭐⭐⭐ challenging. Worked solutions to the starred-with-a-dagger (†) and odd-numbered
problems are in the appendix answers-to-selected.md; do them before you peek. Throughout, keep the
chapter's three confusions in mind: $P(A \mid B) \ne P(B \mid A)$, independent $\ne$ mutually exclusive,
and accuracy (a likelihood) $\ne$ diagnosis (a posterior). For every Bayes problem, the most reliable habit
is to write down the prior, the likelihood, and the reverse likelihood before you reach for a formula.
Part A — Warm-ups (⭐)
21.1 A standard 52-card deck is well shuffled and one card is drawn. Compute $P(\text{king} \mid \text{face card})$, where the face cards are the jacks, queens, and kings. Show the shrunken sample space you are conditioning on, and contrast the answer with the unconditional $P(\text{king}) = \frac{4}{52}$.
21.2 † Use the corrected SaaS joint table from §21.1:
| Converted $C$ | Not converted $\overline{C}$ | Row total | |
|---|---|---|---|
| Paid $A$ | 0.10 | 0.30 | 0.40 |
| Organic $\overline{A}$ | 0.15 | 0.45 | 0.60 |
Compute (a) $P(C \mid \overline{A})$, (b) $P(\overline{A} \mid C)$, and (c) confirm in one sentence why your two answers differ even though both involve organic users who converted.
21.3 State the multiplication rule in both of its forms, and explain in one line why the two products are equal. Then use it to compute $P(A \cap B)$ given $P(A) = 0.3$ and $P(B \mid A) = 0.5$.
21.4 † Two events satisfy $P(A) = 0.5$, $P(B) = 0.4$, and $P(A \cap B) = 0.2$. Are $A$ and $B$ independent? Show the product test, then also verify your answer using the conditional form $P(A \mid B) \stackrel{?}{=} P(A)$.
21.5 A fair six-sided die is rolled once. Let $A = \{\text{even}\}$ and $B = \{2, 3\}$. Compute $P(A \mid B)$ and $P(B \mid A)$ directly from the sample space, verify they are different, and decide whether $A$ and $B$ are independent.
21.6 † In the rare-disease example of §21.4, name which of these three numbers is the base rate, which is the sensitivity, and which is the posterior: $0.001$, $0.99$, $0.0902$. In one sentence, say which number a patient who just tested positive actually cares about, and why it is not the test's accuracy.
21.7 A bag has 3 red and 5 blue marbles. You draw two marbles without replacement. Using the multiplication rule, compute $P(\text{both red}) = P(\text{first red})\,P(\text{second red} \mid \text{first red})$. Then explain in one line why "without replacement" makes the two draws dependent.
Part B — Prove it (⭐⭐)
21.8 † (Scaffolded — fill in the missing steps.) Prove the multiplication rule for three events: for events with $P(A \cap B) > 0$, $$P(A \cap B \cap C) = P(C \mid A \cap B)\,P(B \mid A)\,P(A).$$ Fill in each blank, justifying with the definition of conditional probability. - Start from the definition applied to the event $A \cap B$ as the conditioning set: $P(C \mid A \cap B) = \dfrac{P(A \cap B \cap C)}{\underline{\hphantom{xxxxxx}}}$, so $P(A \cap B \cap C) = P(C \mid A \cap B)\cdot \underline{\hphantom{xxxxxx}}$. - Now expand that last factor with the two-event multiplication rule: $P(A \cap B) = \underline{\hphantom{xxxxxx}}\cdot P(A)$. - Substitute to obtain $P(A \cap B \cap C) = P(C \mid A \cap B)\,P(B \mid A)\,P(A)$, as claimed. $\blacksquare$
21.9 Prove that if $A$ and $B$ are independent, then $A$ and $\overline{B}$ are also independent. (Start from $P(A \cap \overline{B}) = P(A) - P(A \cap B)$ and use the product form.)
21.10 † Prove that if $P(A \mid B) > P(A)$ (the evidence $B$ raises your belief in $A$), then $P(B \mid A) > P(B)$ as well — "relevance is symmetric." Use the multiplication rule.
21.11 Prove that two events $A$ and $B$ with $0 < P(A) < 1$ and $0 < P(B) < 1$ cannot be both mutually exclusive and independent. (Assume both and derive a contradiction.)
21.12 † Prove the law of total probability for three cases: if $A_1, A_2, A_3$ partition the sample space (pairwise disjoint, union is everything) and each has positive probability, then $$P(B) = P(B \mid A_1)P(A_1) + P(B \mid A_2)P(A_2) + P(B \mid A_3)P(A_3).$$ Mirror the two-case proof from §21.3: slice $B$ along the partition, use additivity, then the multiplication rule on each slice.
21.13 Derive the general $n$-case Bayes' theorem $P(A_j \mid B) = \dfrac{P(B \mid A_j)P(A_j)}{\sum_{i=1}^{n} P(B \mid A_i)P(A_i)}$ from the definition of conditional probability and the law of total probability you proved in 21.12.
21.14 † Prove that conditioning preserves the axioms: fix an event $B$ with $P(B) > 0$ and define $Q(A) = P(A \mid B)$. Show that (i) $Q(A) \ge 0$ for all $A$, (ii) $Q(S) = 1$ where $S$ is the whole sample space, and (iii) $Q(A_1 \cup A_2) = Q(A_1) + Q(A_2)$ for disjoint $A_1, A_2$. (This is why "the world where $B$ happened" is itself a legitimate probability space — §21.1's "zoom in" picture made rigorous.)
Part C — Implement it (⭐⭐)
Write Python for each. Do not run it on the grader's machine — hand-trace and include an
# Expected output: comment, matching the book's convention. Reuse the chapter's bayes,
total_probability, and bayes_update signatures where relevant.
21.15 † Write posterior_after_two_tests(base_rate, sensitivity, false_pos_rate) that returns the
probability of disease after two independent positive tests, by computing the first posterior with
bayes_update, then feeding it back in as the new prior for a second update. Hand-trace it on the
rare-disease numbers $(0.001, 0.99, 0.01)$ and state whether the second positive pushes belief above or
below the first posterior.
21.16 Implement is_independent_table(joint) that takes a joint distribution as a dictionary keyed by
(row_value, col_value) pairs (like the SaaS table) and returns True iff the two attributes are
independent — i.e., every joint cell equals the product of its row and column marginals (use a tolerance).
Test it on the §21.2 independent table and the dependent counterexample.
21.17 † Implement naive_bayes_predict(priors, likelihoods, features) that returns the predicted
class label, where priors maps each class to $P(c)$ and likelihoods maps each class to a dict of
$P(f \mid c)$. Use the unnormalized score $P(c)\prod_i P(f_i \mid c)$ and max(..., key=...). Trace it on
the two-word spam/ham example from §21.5.
21.18 Implement monte_carlo_error(per_run_error, k) returning the worst-case combined error
$p^{k}$ of running a one-sided Monte Carlo test $k$ times independently. Then write three lines that find
the smallest $k$ making the error below $10^{-6}$ when $p = \tfrac{1}{2}$, and report that $k$.
21.19 † Implement conditional_from_counts(counts, given, target) that computes a conditional
probability from a contingency table of raw counts (a dict keyed by (row, col) mapping to integer
counts, not probabilities). It should return (count of rows matching both given and target) / (count
of rows matching given). Test it on a small $2\times2$ count table you write out, and confirm it agrees
with dividing the corresponding probability cells.
Part D — Find the error (⭐⭐)
Each "argument" below is wrong. State precisely which step fails and why.
21.20 Claim: "Our spam test is 99% accurate, so if an email is flagged, there is a 99% chance it is spam." "Reasoning": the test's accuracy is $P(\text{flag} \mid \text{spam}) = 0.99$, and the email was flagged, so $P(\text{spam} \mid \text{flag}) = 0.99$. — Identify the fallacy by name and explain what extra quantity is needed to actually compute $P(\text{spam} \mid \text{flag})$.
21.21 † Claim: "Events $A$ and $B$ are disjoint, and disjoint events don't influence each other, so they're independent." — Find the error and give the one-line product-test counterexample using $P(A) = P(B) = 0.3$.
21.22 "Proof" that any positive test means disease. "By Bayes, $P(D \mid T) = \frac{P(T \mid D)P(D)}{P(T)}$. Since $P(T \mid D) = 1$ for a perfect-sensitivity test, $P(D \mid T) = \frac{P(D)}{P(T)}$, and because every diseased person tests positive, $P(T) = P(D)$, so $P(D \mid T) = 1$." — Find the false step. (Hint: is $P(T) = P(D)$ correct when healthy people can also test positive?)
21.23 † Buggy naive Bayes. A student's classifier computes the spam score as
prior + sum(likelihoods[w] for w in words) (a sum, not a product). Explain why this is not naive Bayes,
what probabilistic quantity (if any) it corresponds to, and what concrete misclassification it could
cause relative to the correct product rule.
21.24 "Proof" that independence is transitive. "If $A$ is independent of $B$, and $B$ is independent of $C$, then $A$ is independent of $C$." Give a concrete three-event counterexample (a die or two coins will do) showing independence is not transitive, and identify the unstated assumption the bogus claim relies on.
Part E — Model it (⭐⭐⭐)
21.25 † (Model it.) A content platform wants a "clickbait detector." Editors label 5% of headlines as clickbait. In the labeled training set, the word "shocking" appears in 60% of clickbait headlines and in 4% of normal headlines. A new headline contains "shocking." Translate this scenario into discrete-math objects: name the events, identify which numbers are the prior, the two likelihoods, and what you are asked to compute; then write (do not yet evaluate) the exact Bayes expression for $P(\text{clickbait} \mid \text{"shocking"})$.
21.26 (Model it.) A load balancer routes each request to one of three servers with probabilities $0.5, 0.3, 0.2$. Those servers fail to respond with probabilities $0.01, 0.02, 0.05$ respectively. Model "which server handled it" as a three-case partition and "the request failed" as an event $B$. Write the law-of-total-probability expression for $P(B)$, and write the Bayes expression for $P(\text{server 3} \mid \text{failed})$ — the "which server is to blame?" question. (You will compute these in 21.34.)
21.27 † (Model it.) You are designing a Monte Carlo primality test that, on a composite input, correctly reports "composite" with probability at least $\tfrac{3}{4}$ per round, and on a prime input never errs. You want the probability of wrongly calling a composite "prime" to be below $2^{-80}$ (a common cryptographic target). Model the repeated-test error and write the inequality in $k$ you must solve; then state, to the nearest integer, how many rounds $k$ suffice. (Connects to the RSA thread: this is exactly how key generation in Part IV finds large primes.)
21.28 (Model it — Monty Hall.) On a game show, a prize is behind one of three doors. You pick door 1. The host, who knows where the prize is, opens a different door (2 or 3) that is empty, then offers you the chance to switch. Model the sample space and the host's behavior, define the events precisely, and use conditional probability to compute $P(\text{prize behind door 1} \mid \text{host opened door 3})$ and the probability you win if you switch. Explain why the host's action is informative even though he always opens an empty door.
Part F — Conjecture and test (⭐⭐⭐)
21.29 † (Conjecture and test, then prove.) Build a three-event example to investigate whether pairwise independence implies mutual independence. Toss two fair coins; let $A$ = "first is heads," $B$ = "second is heads," $C$ = "the two coins match (both heads or both tails)." Compute, in code over the four equally likely outcomes, all three pairwise products and the triple product. Conjecture from the output whether the three events are pairwise independent and whether they are mutually independent, then prove your conjecture by hand. What does this say about the difference between the two notions (§21.2)?
21.30 (Conjecture and test.) Write code that, for the rare-disease test with sensitivity $0.99$ and false-positive rate $0.01$, computes the posterior $P(D \mid \text{positive})$ as the base rate sweeps over $0.001, 0.01, 0.05, 0.10, 0.50$. Conjecture the base rate at which the posterior first exceeds $\tfrac{1}{2}$ ("more likely sick than not"), then solve for the exact crossover base rate by setting the Bayes expression equal to $\tfrac{1}{2}$ and confirm it matches your numerical sweep.
21.31 † (Conjecture and test.) Using a random-pivot quicksort comparison-counter (the
quicksort_comparisons function from §21.6), average the comparison count over 1000 runs on a sorted
input of size $n$ for $n = 10, 20, 40, 80$. Conjecture how the average grows relative to $n^2$ and
relative to $n \log_2 n$. Which growth rate do the empirical numbers support, and how does that match the
chapter's $O(n \log n)$ theorem?
21.32 (Conjecture and test.) The "birthday" flavor of conditioning: write code to estimate, by simulation over many random groups, the probability that at least two of $k$ people share a birthday, and separately compute it exactly via the conditional product $1 - \prod_{i=0}^{k-1}\frac{365-i}{365}$ (each new person's birthday conditioned on avoiding all previous ones). Conjecture the smallest $k$ for which the probability exceeds $\tfrac{1}{2}$, then confirm it from the exact product. Why is the conditional product the natural way to compute this?
Part G — Interleaved & Deep Dive
Mixing techniques from earlier chapters keeps them sharp.
21.33 † (Ch. 20 + 21.) A random variable $X$ is the number shown on a fair die. Let $B$ = "the roll is even." Compute the conditional expectation $E[X \mid B]$ — the expected value of $X$ restricted to the conditioning event — directly from the shrunken sample space, and compare it to the unconditional $E[X]$ from Chapter 20.
21.34 (Ch. 16 + 21.) A bit string of length 8 is chosen uniformly at random. Let $A$ = "the string has exactly four 1s" and $B$ = "the first bit is 1." Using the counting techniques of Chapter 16 ($\binom{n}{k}$), compute $P(A)$, $P(A \cap B)$, and then $P(A \mid B)$. Are $A$ and $B$ independent?
21.35 † (Ch. 20 + 21.) In the randomized-quicksort analysis the total comparison count is
$X = \sum_{i
21.36 (Ch. 16 + 21, Deep Dive.) Generalize the naive Bayes likelihood count. If a message is modeled by the presence/absence of each of $v$ vocabulary words, how many distinct feature vectors are there, and how many free joint probabilities would you need to estimate without the independence assumption? How many do you need with it? Explain in one paragraph why this exponential-vs-linear gap (a counting fact from Chapter 16) is the entire practical justification for the "naive" assumption.
21.37 † (Deep Dive — finish the model from 21.26.) Evaluate the load-balancer model: compute $P(B)$ by the law of total probability and then $P(\text{server 3} \mid \text{failed})$ by Bayes. Comment on why the least-used server can still be the most likely culprit given a failure — relate this to how the base rate (routing probability) and likelihood (failure rate) trade off.
21.38 (Deep Dive.) Prove that for a Monte Carlo algorithm with two-sided error — wrong with probability at most $p < \tfrac{1}{2}$ on every instance — taking the majority vote of $k$ independent runs reduces the error, and explain qualitatively (no Chernoff bound needed) why the one-sided amplification of §21.6 (answer "yes" if any run says yes) is simpler and gives the cleaner $p^{k}$ bound. Which setting does the Miller–Rabin primality test fall into?
Solutions to † and odd-numbered exercises: see appendices/answers-to-selected.md. For each derivation,
the rubric rewards: correctly named events, the right conditioning set (row vs. column total), explicit
use of the law of total probability where the denominator isn't given, and — for the modeling problems —
a clean translation from the English scenario to the probability objects before any arithmetic.