Chapter 21 Key Takeaways — Conditional Probability & Bayes' Theorem

A dense, skimmable reference. Reread this before an exam or before reaching for Bayes in code.


Core formulas

Name Formula Conditions
Conditional probability $P(A\mid B) = \dfrac{P(A\cap B)}{P(B)}$ $P(B)>0$
Multiplication rule $P(A\cap B) = P(A\mid B)P(B) = P(B\mid A)P(A)$
Independence (product form) $P(A\cap B) = P(A)P(B)$ definition
Independence (conditional form) $P(A\mid B) = P(A)$ $P(B)>0$
Law of total probability (2 cases) $P(B) = P(B\mid A)P(A) + P(B\mid\overline{A})P(\overline{A})$ $0
Law of total probability ($n$ cases) $P(B) = \sum_{i=1}^{n} P(B\mid A_i)P(A_i)$ $A_i$ partition the space
Bayes' theorem $P(A\mid B) = \dfrac{P(B\mid A)P(A)}{P(B)}$ $P(B)>0$
Bayes (expanded denominator) $P(A\mid B) = \dfrac{P(B\mid A)P(A)}{P(B\mid A)P(A) + P(B\mid\overline{A})P(\overline{A})}$ $0

Bayes vocabulary: prior $P(A)$ · likelihood $P(B\mid A)$ · evidence $P(B)$ · posterior $P(A\mid B)$. Mnemonic: $\text{posterior} \propto \text{likelihood}\times\text{prior}$. The numerator is always one term of the (expanded) denominator.


The three confusions to never make

Looks the same Is actually Test
$P(A\mid B)$ vs. $P(B\mid A)$ different numbers (same numerator $P(A\cap B)$, different denominator) divide by row total vs. column total
independent vs. mutually exclusive near opposites indep: $P(A\cap B)=P(A)P(B)$; disjoint: $P(A\cap B)=0$
test accuracy vs. diagnosis likelihood $P(T\mid D)$ vs. posterior $P(D\mid T)$ bridge them with the base rate $P(D)$

Disjoint + both positive probability $\Rightarrow$ dependent (one occurring rules the other out).


Base-rate fallacy (decision rule)

Symptom: treating $P(D\mid T)$ as if it equals the test's accuracy $P(T\mid D)$ for a rare $D$.

Quantify it two ways:

  1. Bayes: $P(D\mid T) = \dfrac{\text{sens}\cdot\text{base}}{\text{sens}\cdot\text{base} + \text{fpr}\cdot(1-\text{base})}$.
  2. Natural frequencies: imagine 100,000 people; count true positives vs. false positives directly.

Reference numbers (sensitivity $=0.99$, false-positive rate $=0.01$):

Base rate $P(D)$ $P(D\mid \text{positive})$
0.001 ≈ 0.09
0.01 ≈ 0.50
0.10 ≈ 0.92

Rule of thumb: a detector for a rare event needs an extraordinarily low false-positive rate, or its positives are mostly false alarms. (Same math: fraud detection, intrusion detection, the prosecutor's fallacy.)


Naive Bayes (decision procedure)

$$\text{predict } \arg\max_{c}\ \ P(c)\prod_{i} P(f_i\mid c).$$

Step What to do
Priors $P(c)$ = fraction of training data in class $c$
Likelihoods $P(f_i\mid c)$ = per-feature frequency within class $c$
Score (unnormalized) multiply prior by all per-feature likelihoods, per class
Decide pick the class with the larger score (denominator cancels)
Calibrate (optional) normalize: score / (sum of scores)

"Naive" = features assumed mutually independent given the class ($P(f_1,\dots,f_n\mid c)=\prod_i P(f_i\mid c)$). Usually false, often still right about the ranking.

Pitfall Cause Fix
Zero-frequency an unseen feature has $P(f\mid c)=0$, zeroing the whole product Laplace (add-one) smoothing
Underflow product of many small probabilities $\to 0$ in floating point sum log-probabilities: $\log P(c)+\sum_i\log P(f_i\mid c)$

Randomized algorithms

Type Correctness Running time Example
Las Vegas always correct random (fast in expectation) randomized quicksort
Monte Carlo may err (bounded prob.) bounded Miller–Rabin primality

Monte Carlo amplification: with one-sided error $\le p$ per run and $k$ independent runs, $$P(\text{all } k \text{ wrong}) \le p^{k} \quad\Rightarrow\quad p=\tfrac12,\ k=20 \Rightarrow \text{error} < 10^{-6}.$$

Randomized quicksort (expected comparisons):

  • Total comparisons $X = \sum_{iat most once).
  • $E[X] = \sum_{ilinearity of expectation (works despite dependence).
  • $P(z_i,z_j \text{ compared}) = \dfrac{2}{\,j-i+1\,}$ (one of them must be the first pivot drawn from the $j-i+1$ block).
  • $\Rightarrow E[X] = O(n\log n)$ on every input. Randomness defeats adversarial inputs: the worst case lives in the RNG, not the data.

Toolkit additions (probability.py, Ch. 20–21)

Function Signature Returns
Posterior bayes(pa, pb_given_a, pb) $P(A\mid B)=\dfrac{P(B\mid A)P(A)}{P(B)}$ (raises if pb==0)
Evidence total_probability(pa, pb_given_a, pb_given_not_a) $P(B)$ via law of total probability
One-call update bayes_update(pa, pb_given_a, pb_given_not_a) posterior, computing $P(B)$ for you

(Builds on Ch. 20's simulate(trial_fn, k) and expected_value(rv, space).)


When to use what

You have… You want… Use
Joint table a conditional divide the joint cell by the row/column total
Prior + likelihood + reverse likelihood the posterior bayes_update (Bayes + total probability)
Many features, one label a class prediction naive Bayes ($\arg\max$ of prior $\times$ product)
A correct-but-slow worst case from adversarial input robustness randomize the choice (Las Vegas)
A hard exact test a fast answer you can trust Monte Carlo + repeat to shrink error