Chapter 21 Key Takeaways — Conditional Probability & Bayes' Theorem

DataField.Dev

Chapter 21 Key Takeaways — Conditional Probability & Bayes' Theorem

A dense, skimmable reference. Reread this before an exam or before reaching for Bayes in code.

Core formulas

Name	Formula	Conditions
Conditional probability	$P(A\mid B) = \dfrac{P(A\cap B)}{P(B)}$	$P(B)>0$
Multiplication rule	$P(A\cap B) = P(A\mid B)P(B) = P(B\mid A)P(A)$	—
Independence (product form)	$P(A\cap B) = P(A)P(B)$	definition
Independence (conditional form)	$P(A\mid B) = P(A)$	$P(B)>0$
Law of total probability (2 cases)	$P(B) = P(B\mid A)P(A) + P(B\mid\overline{A})P(\overline{A})$	$0
Law of total probability ($n$ cases)	$P(B) = \sum_{i=1}^{n} P(B\mid A_i)P(A_i)$	$A_i$ partition the space
Bayes' theorem	$P(A\mid B) = \dfrac{P(B\mid A)P(A)}{P(B)}$	$P(B)>0$
Bayes (expanded denominator)	$P(A\mid B) = \dfrac{P(B\mid A)P(A)}{P(B\mid A)P(A) + P(B\mid\overline{A})P(\overline{A})}$	$0

Bayes vocabulary: prior $P(A)$ · likelihood $P(B\mid A)$ · evidence $P(B)$ · posterior $P(A\mid B)$. Mnemonic: $\text{posterior} \propto \text{likelihood}\times\text{prior}$. The numerator is always one term of the (expanded) denominator.

The three confusions to never make

Looks the same	Is actually	Test
$P(A\mid B)$ vs. $P(B\mid A)$	different numbers (same numerator $P(A\cap B)$, different denominator)	divide by row total vs. column total
independent vs. mutually exclusive	near opposites	indep: $P(A\cap B)=P(A)P(B)$; disjoint: $P(A\cap B)=0$
test accuracy vs. diagnosis	likelihood $P(T\mid D)$ vs. posterior $P(D\mid T)$	bridge them with the base rate $P(D)$

Disjoint + both positive probability $\Rightarrow$ dependent (one occurring rules the other out).

Base-rate fallacy (decision rule)

Symptom: treating $P(D\mid T)$ as if it equals the test's accuracy $P(T\mid D)$ for a rare $D$.

Quantify it two ways:

Bayes: $P(D\mid T) = \dfrac{\text{sens}\cdot\text{base}}{\text{sens}\cdot\text{base} + \text{fpr}\cdot(1-\text{base})}$.
Natural frequencies: imagine 100,000 people; count true positives vs. false positives directly.

Reference numbers (sensitivity $=0.99$, false-positive rate $=0.01$):

Base rate $P(D)$	$P(D\mid \text{positive})$
0.001	≈ 0.09
0.01	≈ 0.50
0.10	≈ 0.92

Rule of thumb: a detector for a rare event needs an extraordinarily low false-positive rate, or its positives are mostly false alarms. (Same math: fraud detection, intrusion detection, the prosecutor's fallacy.)

Naive Bayes (decision procedure)

$$\text{predict } \arg\max_{c}\ \ P(c)\prod_{i} P(f_i\mid c).$$

Step	What to do
Priors	$P(c)$ = fraction of training data in class $c$
Likelihoods	$P(f_i\mid c)$ = per-feature frequency within class $c$
Score (unnormalized)	multiply prior by all per-feature likelihoods, per class
Decide	pick the class with the larger score (denominator cancels)
Calibrate (optional)	normalize: score / (sum of scores)

"Naive" = features assumed mutually independent given the class ($P(f_1,\dots,f_n\mid c)=\prod_i P(f_i\mid c)$). Usually false, often still right about the ranking.

Pitfall	Cause	Fix
Zero-frequency	an unseen feature has $P(f\mid c)=0$, zeroing the whole product	Laplace (add-one) smoothing
Underflow	product of many small probabilities $\to 0$ in floating point	sum log-probabilities: $\log P(c)+\sum_i\log P(f_i\mid c)$

Randomized algorithms

Type	Correctness	Running time	Example
Las Vegas	always correct	random (fast in expectation)	randomized quicksort
Monte Carlo	may err (bounded prob.)	bounded	Miller–Rabin primality

Monte Carlo amplification: with one-sided error $\le p$ per run and $k$ independent runs, $$P(\text{all } k \text{ wrong}) \le p^{k} \quad\Rightarrow\quad p=\tfrac12,\ k=20 \Rightarrow \text{error} < 10^{-6}.$$

Randomized quicksort (expected comparisons):

Total comparisons $X = \sum_{iat most once).
$E[X] = \sum_{ilinearity of expectation (works despite dependence).
$P(z_i,z_j \text{ compared}) = \dfrac{2}{\,j-i+1\,}$ (one of them must be the first pivot drawn from the $j-i+1$ block).
$\Rightarrow E[X] = O(n\log n)$ on every input. Randomness defeats adversarial inputs: the worst case lives in the RNG, not the data.

Toolkit additions (`probability.py`, Ch. 20–21)

Function	Signature	Returns
Posterior	`bayes(pa, pb_given_a, pb)`	$P(A\mid B)=\dfrac{P(B\mid A)P(A)}{P(B)}$ (raises if `pb==0`)
Evidence	`total_probability(pa, pb_given_a, pb_given_not_a)`	$P(B)$ via law of total probability
One-call update	`bayes_update(pa, pb_given_a, pb_given_not_a)`	posterior, computing $P(B)$ for you

(Builds on Ch. 20's simulate(trial_fn, k) and expected_value(rv, space).)

When to use what

You have…	You want…	Use
Joint table	a conditional	divide the joint cell by the row/column total
Prior + likelihood + reverse likelihood	the posterior	`bayes_update` (Bayes + total probability)
Many features, one label	a class prediction	naive Bayes ($\arg\max$ of prior $\times$ product)
A correct-but-slow worst case from adversarial input	robustness	randomize the choice (Las Vegas)
A hard exact test	a fast answer you can trust	Monte Carlo + repeat to shrink error

Chapter 21 Key Takeaways — Conditional Probability & Bayes' Theorem

Core formulas

The three confusions to never make

Base-rate fallacy (decision rule)

Naive Bayes (decision procedure)

Randomized algorithms

Toolkit additions (probability.py, Ch. 20–21)

When to use what

Toolkit additions (`probability.py`, Ch. 20–21)