Chapter 21 Key Takeaways — Conditional Probability & Bayes' Theorem
A dense, skimmable reference. Reread this before an exam or before reaching for Bayes in code.
Core formulas
| Name | Formula | Conditions |
|---|---|---|
| Conditional probability | $P(A\mid B) = \dfrac{P(A\cap B)}{P(B)}$ | $P(B)>0$ |
| Multiplication rule | $P(A\cap B) = P(A\mid B)P(B) = P(B\mid A)P(A)$ | — |
| Independence (product form) | $P(A\cap B) = P(A)P(B)$ | definition |
| Independence (conditional form) | $P(A\mid B) = P(A)$ | $P(B)>0$ |
| Law of total probability (2 cases) | $P(B) = P(B\mid A)P(A) + P(B\mid\overline{A})P(\overline{A})$ | $0
|
| Law of total probability ($n$ cases) | $P(B) = \sum_{i=1}^{n} P(B\mid A_i)P(A_i)$ | $A_i$ partition the space |
| Bayes' theorem | $P(A\mid B) = \dfrac{P(B\mid A)P(A)}{P(B)}$ | $P(B)>0$ |
| Bayes (expanded denominator) | $P(A\mid B) = \dfrac{P(B\mid A)P(A)}{P(B\mid A)P(A) + P(B\mid\overline{A})P(\overline{A})}$ | $0
|
Bayes vocabulary: prior $P(A)$ · likelihood $P(B\mid A)$ · evidence $P(B)$ · posterior $P(A\mid B)$. Mnemonic: $\text{posterior} \propto \text{likelihood}\times\text{prior}$. The numerator is always one term of the (expanded) denominator.
The three confusions to never make
| Looks the same | Is actually | Test |
|---|---|---|
| $P(A\mid B)$ vs. $P(B\mid A)$ | different numbers (same numerator $P(A\cap B)$, different denominator) | divide by row total vs. column total |
| independent vs. mutually exclusive | near opposites | indep: $P(A\cap B)=P(A)P(B)$; disjoint: $P(A\cap B)=0$ |
| test accuracy vs. diagnosis | likelihood $P(T\mid D)$ vs. posterior $P(D\mid T)$ | bridge them with the base rate $P(D)$ |
Disjoint + both positive probability $\Rightarrow$ dependent (one occurring rules the other out).
Base-rate fallacy (decision rule)
Symptom: treating $P(D\mid T)$ as if it equals the test's accuracy $P(T\mid D)$ for a rare $D$.
Quantify it two ways:
- Bayes: $P(D\mid T) = \dfrac{\text{sens}\cdot\text{base}}{\text{sens}\cdot\text{base} + \text{fpr}\cdot(1-\text{base})}$.
- Natural frequencies: imagine 100,000 people; count true positives vs. false positives directly.
Reference numbers (sensitivity $=0.99$, false-positive rate $=0.01$):
| Base rate $P(D)$ | $P(D\mid \text{positive})$ |
|---|---|
| 0.001 | ≈ 0.09 |
| 0.01 | ≈ 0.50 |
| 0.10 | ≈ 0.92 |
Rule of thumb: a detector for a rare event needs an extraordinarily low false-positive rate, or its positives are mostly false alarms. (Same math: fraud detection, intrusion detection, the prosecutor's fallacy.)
Naive Bayes (decision procedure)
$$\text{predict } \arg\max_{c}\ \ P(c)\prod_{i} P(f_i\mid c).$$
| Step | What to do |
|---|---|
| Priors | $P(c)$ = fraction of training data in class $c$ |
| Likelihoods | $P(f_i\mid c)$ = per-feature frequency within class $c$ |
| Score (unnormalized) | multiply prior by all per-feature likelihoods, per class |
| Decide | pick the class with the larger score (denominator cancels) |
| Calibrate (optional) | normalize: score / (sum of scores) |
"Naive" = features assumed mutually independent given the class ($P(f_1,\dots,f_n\mid c)=\prod_i P(f_i\mid c)$). Usually false, often still right about the ranking.
| Pitfall | Cause | Fix |
|---|---|---|
| Zero-frequency | an unseen feature has $P(f\mid c)=0$, zeroing the whole product | Laplace (add-one) smoothing |
| Underflow | product of many small probabilities $\to 0$ in floating point | sum log-probabilities: $\log P(c)+\sum_i\log P(f_i\mid c)$ |
Randomized algorithms
| Type | Correctness | Running time | Example |
|---|---|---|---|
| Las Vegas | always correct | random (fast in expectation) | randomized quicksort |
| Monte Carlo | may err (bounded prob.) | bounded | Miller–Rabin primality |
Monte Carlo amplification: with one-sided error $\le p$ per run and $k$ independent runs, $$P(\text{all } k \text{ wrong}) \le p^{k} \quad\Rightarrow\quad p=\tfrac12,\ k=20 \Rightarrow \text{error} < 10^{-6}.$$
Randomized quicksort (expected comparisons):
- Total comparisons $X = \sum_{i
at most once). - $E[X] = \sum_{i
linearity of expectation (works despite dependence). - $P(z_i,z_j \text{ compared}) = \dfrac{2}{\,j-i+1\,}$ (one of them must be the first pivot drawn from the $j-i+1$ block).
- $\Rightarrow E[X] = O(n\log n)$ on every input. Randomness defeats adversarial inputs: the worst case lives in the RNG, not the data.
Toolkit additions (probability.py, Ch. 20–21)
| Function | Signature | Returns |
|---|---|---|
| Posterior | bayes(pa, pb_given_a, pb) |
$P(A\mid B)=\dfrac{P(B\mid A)P(A)}{P(B)}$ (raises if pb==0) |
| Evidence | total_probability(pa, pb_given_a, pb_given_not_a) |
$P(B)$ via law of total probability |
| One-call update | bayes_update(pa, pb_given_a, pb_given_not_a) |
posterior, computing $P(B)$ for you |
(Builds on Ch. 20's simulate(trial_fn, k) and expected_value(rv, space).)
When to use what
| You have… | You want… | Use |
|---|---|---|
| Joint table | a conditional | divide the joint cell by the row/column total |
| Prior + likelihood + reverse likelihood | the posterior | bayes_update (Bayes + total probability) |
| Many features, one label | a class prediction | naive Bayes ($\arg\max$ of prior $\times$ product) |
| A correct-but-slow worst case from adversarial input | robustness | randomize the choice (Las Vegas) |
| A hard exact test | a fast answer you can trust | Monte Carlo + repeat to shrink error |