Chapter 9: Conditional Probability and Bayes' Theorem

Contributors

36 min read

> "When the facts change, I change my mind. What do you do, sir?"

Prerequisites

8
1
6
Basic arithmetic (fractions, decimals, percentages)

Learning Objectives

Calculate conditional probabilities from tables and real-world scenarios
Distinguish between P(A|B) and P(B|A) — the 'prosecutor's fallacy'
Apply Bayes' theorem to update probabilities with new evidence
Explain why Bayes' theorem matters for medical testing, AI, and decision-making
Construct tree diagrams for multi-step probability problems

In This Chapter

Chapter Overview
9.1 A Puzzle Before We Start (Productive Struggle)
9.2 Conditional Probability: The Basics
9.3 The Critical Distinction: P(A|B) vs. P(B|A)
9.4 Conditional Probability in Python
9.5 Tree Diagrams: Seeing the Branches
9.6 Bayes' Theorem: The Formula
9.7 The Natural Frequency Approach: Bayes Without the Formula
9.8 Medical Testing Vocabulary
9.9 Working Through Bayes: Alex's Recommendation Engine
9.10 Multiple Updates: Bayes as a Learning Machine
9.11 James Washington and Recidivism: Bayes Meets Justice
9.12 Bayes' Theorem in Python
9.13 Prior and Posterior: The Language of Updating
9.14 The Base Rate Fallacy: A Vivid Example
9.15 Tree Diagrams: A Complete Worked Example
9.16 Independence Revisited Through Conditional Probability
9.17 Progressive Project Checkpoint: Conditional Probabilities in Your Dataset
9.18 Chapter Summary
Key Formulas at a Glance

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 9: Conditional Probability and Bayes' Theorem

"When the facts change, I change my mind. What do you do, sir?" — Attributed to John Maynard Keynes

Chapter Overview

Here's a scenario that will make you rethink everything you thought you knew about probability.

You go for a routine health screening. The test is 99% accurate — if you have the disease, it'll catch it 99 times out of 100. The test also has a low false positive rate: only 1% of healthy people incorrectly test positive. Your result comes back positive.

Quick: what's the probability you actually have the disease?

If you said 99% — or even something close to it — you're in good company. Most people answer that way. Most doctors answer that way. But you'd be wrong. Depending on how rare the disease is, your actual probability of having it might be as low as 9%.

Wait, what?

That stunning gap between intuition and reality is the entire reason this chapter exists. The concept that bridges that gap is called Bayes' theorem, and understanding it will fundamentally change how you think about evidence, testing, and decision-making.

In Chapter 8, you learned the basic rules of probability — complement, addition, multiplication. You learned to build contingency tables and calculate joint and marginal probabilities. Those tools are powerful, but they're missing something crucial: they don't account for new information.

In real life, you don't just calculate probabilities in a vacuum. You learn things. A test comes back positive. An email contains certain words. A defendant matches an eyewitness description. And when you learn something new, your probabilities should change. That's what this chapter is about.

In this chapter, you will learn to: - Calculate conditional probabilities from tables and real-world scenarios - Distinguish between P(A|B) and P(B|A) — the "prosecutor's fallacy" - Apply Bayes' theorem to update probabilities with new evidence - Explain why Bayes' theorem matters for medical testing, AI, and decision-making - Construct tree diagrams for multi-step probability problems

Fast Track: If you're comfortable with conditional probability notation and can explain why P(A|B) ≠ P(B|A), skim Sections 9.1-9.3 and jump to Section 9.5 (Bayes' Theorem). Complete quiz questions 1, 10, and 17 to verify your understanding.

Deep Dive: After this chapter, read Case Study 1 (medical testing) for a worked example of why screening programs produce so many false alarms, then Case Study 2 (the prosecutor's fallacy) for the chilling story of how confusing P(A|B) and P(B|A) has sent innocent people to prison.

9.1 A Puzzle Before We Start (Productive Struggle)

Before I teach you anything, try this.

The False Positive Puzzle

A disease affects 1 in 1,000 people. A diagnostic test has these characteristics: - If a person HAS the disease, the test correctly identifies them 99% of the time. - If a person does NOT have the disease, the test incorrectly flags them 2% of the time.

A randomly selected person tests positive. What is the probability they actually have the disease?

Before you calculate anything, write down your gut estimate. Then try to work through it. Don't worry if you get stuck — that's the whole point.

Most people's gut says somewhere around 95-99%. After all, the test is "99% accurate," right?

We'll solve this problem together in Section 9.6, and the answer will surprise you. For now, let the discomfort sit. That productive struggle — that feeling of this doesn't match my intuition — is the threshold concept of this chapter clicking into place.

9.2 Conditional Probability: The Basics

Let's build up to Bayes' theorem step by step, starting with the most important new idea in this chapter: conditional probability.

What Is Conditional Probability?

In Chapter 8, every probability we calculated was unconditional — we looked at the entire sample space and asked, "What fraction satisfies our criteria?" We calculated P(smoker) by dividing all smokers by the total number of people. We calculated P(respiratory illness) by dividing everyone with a respiratory illness by the total.

But what if you already know something? What if I tell you the person is a smoker and then ask about respiratory illness? That changes things.

Key Concept: Conditional Probability

Conditional probability is the probability of an event occurring given that another event has already occurred or is known to be true.

We write it as $P(A \mid B)$, read as "the probability of A given B."

The vertical bar $\mid$ means "given" or "knowing that." It restricts the universe we're considering.

Here's the crucial intuition: when you condition on B, B becomes your new universe. You're no longer looking at the entire sample space — you're only looking at the portion where B is true, and asking what fraction of that portion also satisfies A.

Calculating Conditional Probability from a Contingency Table

Let's return to Maya's smoking and respiratory illness data from Chapter 8:

	Has Respiratory Illness	No Respiratory Illness	Row Total
Smoker	68	92	160
Non-Smoker	42	298	340
Column Total	110	390	500

In Chapter 8, we calculated unconditional probabilities:

$$P(\text{respiratory illness}) = \frac{110}{500} = 0.22$$

That's the probability that a randomly selected person from this sample has a respiratory illness. No conditions. No restrictions.

But Maya wants to ask a sharper question: among smokers, what fraction has a respiratory illness?

$$P(\text{respiratory illness} \mid \text{smoker}) = \frac{68}{160} = 0.425$$

Look at what happened. When we conditioned on "smoker," the denominator changed from 500 (everyone) to 160 (just smokers). We restricted our universe to the smoker row, then asked what fraction of that row has a respiratory illness.

Now compare this with non-smokers:

$$P(\text{respiratory illness} \mid \text{non-smoker}) = \frac{42}{340} = 0.124$$

Smokers are more than three times as likely to have a respiratory illness compared to non-smokers (42.5% vs. 12.4%). That's the power of conditional probability — it reveals relationships that raw marginal probabilities hide.

The Formal Definition

There's a formula that captures exactly what we just did intuitively:

Mathematical Formulation: Conditional Probability

$$\boxed{P(A \mid B) = \frac{P(A \text{ and } B)}{P(B)}}$$

In words: The probability of A given B equals the probability that both A and B occur, divided by the probability that B occurs.

Requirements: $P(B) > 0$ (you can't condition on an impossible event).

Let's verify this with Maya's data:

$$P(\text{illness} \mid \text{smoker}) = \frac{P(\text{illness and smoker})}{P(\text{smoker})} = \frac{68/500}{160/500} = \frac{68}{160} = 0.425 \checkmark$$

The 500s cancel out, leaving us with exactly the calculation we did before: cell count divided by row total.

Reading the Notation

The notation $P(A \mid B)$ can feel intimidating at first. Here's a trick: always read the bar as "among those where..." or "given that..."

$P(\text{illness} \mid \text{smoker})$ = "Among smokers, what's the probability of illness?"
$P(\text{smoker} \mid \text{illness})$ = "Among those with illness, what's the probability of being a smoker?"
$P(\text{pass} \mid \text{studied})$ = "Among students who studied, what's the probability of passing?"
$P(\text{spam} \mid \text{contains 'free'})$ = "Among emails containing 'free,' what's the probability of being spam?"

Spaced Review 1: Spam Filters (Ch. 1)

Way back in Chapter 1, I told you that spam filters use "a method called Bayes' theorem (Chapter 9) to calculate the likelihood that an email is spam based on which words it contains." We're here now. A spam filter calculates $P(\text{spam} \mid \text{words in email})$ — the conditional probability that an email is spam, given the specific words it contains. In Chapter 8, you saw this referenced again when we discussed how AI systems use probability. Now you have the notation. By the end of this chapter, you'll understand the full machinery. Promise kept.

9.3 The Critical Distinction: P(A|B) vs. P(B|A)

This is the single most important section of this chapter. Read it carefully.

$P(A \mid B)$ and $P(B \mid A)$ are not the same thing. They are almost never equal. Confusing them is one of the most common — and most dangerous — errors in probabilistic reasoning.

Let's see why with Maya's data.

$$P(\text{illness} \mid \text{smoker}) = \frac{68}{160} = 0.425$$

This asks: "If I know someone is a smoker, what's the probability they have a respiratory illness?" Answer: 42.5%.

$$P(\text{smoker} \mid \text{illness}) = \frac{68}{110} = 0.618$$

This asks: "If I know someone has a respiratory illness, what's the probability they're a smoker?" Answer: 61.8%.

These are different questions with different answers. The first restricts to smokers and asks about illness. The second restricts to people with illness and asks about smoking. Different universe, different fraction.

Why This Matters: The Prosecutor's Fallacy

The confusion between $P(A \mid B)$ and $P(B \mid A)$ has a name when it happens in courtrooms: the prosecutor's fallacy. And it has sent innocent people to prison.

Key Term: Prosecutor's Fallacy

The prosecutor's fallacy is the error of confusing $P(\text{evidence} \mid \text{innocent})$ with $P(\text{innocent} \mid \text{evidence})$.

A prosecutor might argue: "The probability of this DNA match occurring by random chance is 1 in 10 million. Therefore, the probability that the defendant is innocent is 1 in 10 million."

But these are completely different probabilities: - $P(\text{DNA match} \mid \text{innocent})$ = the probability of a coincidental match = 1 in 10 million - $P(\text{innocent} \mid \text{DNA match})$ = the probability the defendant is innocent, given the match

The second depends on many other factors — like how many people could plausibly be suspects, the strength of other evidence, and the prior probability that this specific person committed the crime. A 1-in-10-million DNA match means something very different when there's one suspect vs. when every person in a country of 300 million was screened.

Here's an everyday example that makes the asymmetry obvious:

$P(\text{wet streets} \mid \text{rain})$ is very high — nearly 1.0. If it rained, the streets are almost certainly wet.
$P(\text{rain} \mid \text{wet streets})$ is much lower. The streets could be wet because of a burst water main, a fire hydrant, street cleaning, or morning dew.

Knowing that rain causes wet streets does NOT mean that wet streets prove rain.

More Examples of the Asymmetry

Statement	Conditional Probability	Its Reverse	Equal?
Most dogs are mammals	$P(\text{mammal} \mid \text{dog}) \approx 1.0$	$P(\text{dog} \mid \text{mammal})$ is very small	No
Most spam emails contain "free"	$P(\text{contains 'free'} \mid \text{spam})$ is high	$P(\text{spam} \mid \text{contains 'free'})$ is moderate	No
Most NBA players are tall	$P(\text{tall} \mid \text{NBA player})$ is very high	$P(\text{NBA player} \mid \text{tall})$ is very low	No

The pattern: when one group is much larger or smaller than the other, $P(A \mid B)$ and $P(B \mid A)$ can be wildly different.

Professor Washington's Data

Let's revisit Professor Washington's predictive policing data from Chapter 8:

Re-Offense No Re-Offense Total

High Risk 120 180 300

Low Risk 40 460 500

Total 160 640 800

Washington calculates two conditional probabilities:

$$P(\text{re-offense} \mid \text{high risk}) = \frac{120}{300} = 0.40$$

$$P(\text{high risk} \mid \text{re-offense}) = \frac{120}{160} = 0.75$$

The first says: among people flagged as high risk, 40% actually re-offended. The second says: among people who re-offended, 75% had been flagged as high risk.

A policymaker citing the second number (75%) might argue the algorithm is great at identifying future offenders. But Washington points out the first number (40%) tells a different story: 60% of people flagged as "high risk" never re-offended. Those are people whose freedoms might be restricted based on a label that was wrong more often than it was right.

Which number you emphasize is not just a mathematical question — it's an ethical one.

9.4 Conditional Probability in Python

Let's compute conditional probabilities from data using pandas. This builds directly on the contingency table skills you practiced in Chapter 8.

import pandas as pd
import numpy as np

# Recreate Maya's health data
smoking_status = (['Smoker'] * 160) + (['Non-Smoker'] * 340)
respiratory = (['Yes'] * 68 + ['No'] * 92 +   # Smokers
               ['Yes'] * 42 + ['No'] * 298)     # Non-Smokers

health_df = pd.DataFrame({
    'smoking_status': smoking_status,
    'respiratory_illness': respiratory
})

# --- Method 1: Using pd.crosstab with normalize='index' ---
# This gives row-wise proportions = conditional probabilities
cond_probs = pd.crosstab(
    health_df['smoking_status'],
    health_df['respiratory_illness'],
    normalize='index'  # Normalize within each ROW
)
print("P(illness | smoking status):")
print(cond_probs.round(4))
print()

# --- Method 2: Using .groupby() ---
# Group by smoking status, then calculate the proportion with illness
illness_rate = (health_df
    .groupby('smoking_status')['respiratory_illness']
    .value_counts(normalize=True)
    .unstack()
    .round(4))
print("Conditional probabilities via groupby:")
print(illness_rate)
print()

# --- Method 3: Direct calculation ---
# P(illness | smoker) = count(illness AND smoker) / count(smoker)
smokers = health_df[health_df['smoking_status'] == 'Smoker']
p_illness_given_smoker = (smokers['respiratory_illness'] == 'Yes').mean()
print(f"P(illness | smoker) = {p_illness_given_smoker:.4f}")

nonsmokers = health_df[health_df['smoking_status'] == 'Non-Smoker']
p_illness_given_nonsmoker = (nonsmokers['respiratory_illness'] == 'Yes').mean()
print(f"P(illness | non-smoker) = {p_illness_given_nonsmoker:.4f}")
print()

# --- Demonstrate the asymmetry: P(A|B) ≠ P(B|A) ---
# P(smoker | illness)
ill = health_df[health_df['respiratory_illness'] == 'Yes']
p_smoker_given_illness = (ill['smoking_status'] == 'Smoker').mean()
print(f"P(smoker | illness)  = {p_smoker_given_illness:.4f}")
print(f"P(illness | smoker)  = {p_illness_given_smoker:.4f}")
print(f"These are NOT the same!")

Spaced Review 2: Interpreting Percentages (Ch. 6)

In Chapter 6, you learned to be careful with numerical summaries — that a mean can hide important patterns in the data. The same caution applies here. When someone tells you "42.5% of smokers have respiratory illness," ask yourself: 42.5% of how many? The percentage is a conditional probability, and its meaning depends on the size of the group you're conditioning on. In Chapter 6, we saw that identical means could mask very different distributions. Here, identical percentages can mask very different group sizes. Always ask about the denominator.

9.5 Tree Diagrams: Seeing the Branches

Before we tackle Bayes' theorem with a formula, I want to give you a visual tool that makes the logic intuitive. It's called a tree diagram, and it's the single most useful device for solving conditional probability problems.

Key Term: Tree Diagram

A tree diagram is a visual representation of a multi-step probability problem. Each "branch" represents one possible outcome, and the probability is written along the branch. To find the probability of any complete path, you multiply along the branches.

Building a Tree Diagram: Maya's Disease Screening

Let's set up a simplified version of Maya's disease screening problem. Suppose:

1% of people in the county have a particular respiratory disease (this is the prevalence or base rate).
If someone HAS the disease, the test detects it 95% of the time (this is the test's sensitivity).
If someone does NOT have the disease, the test correctly returns negative 90% of the time (this is the test's specificity).

Here's the tree diagram:

                          Population
                         /          \
                    Disease        No Disease
                   P = 0.01        P = 0.99
                   /     \         /       \
              Test+     Test-   Test+     Test-
             P=0.95    P=0.05  P=0.10    P=0.90
              |          |       |         |
          True Pos   False Neg  False Pos  True Neg
         0.01×0.95  0.01×0.05  0.99×0.10  0.99×0.90
          =0.0095    =0.0005    =0.099     =0.891

Let's unpack each path:

True Positive (has disease AND tests positive): $0.01 \times 0.95 = 0.0095$
False Negative (has disease AND tests negative): $0.01 \times 0.05 = 0.0005$
False Positive (no disease AND tests positive): $0.99 \times 0.10 = 0.099$
True Negative (no disease AND tests negative): $0.99 \times 0.90 = 0.891$

Check: $0.0095 + 0.0005 + 0.099 + 0.891 = 1.000$ ✓ (All paths sum to 1.)

Now here's the key question: If someone tests positive, what's the probability they actually have the disease?

Positive results come from two paths: true positives (0.0095) and false positives (0.099).

$$P(\text{disease} \mid \text{positive}) = \frac{P(\text{true positive})}{P(\text{true positive}) + P(\text{false positive})} = \frac{0.0095}{0.0095 + 0.099} = \frac{0.0095}{0.1085} \approx 0.0876$$

About 8.8%. A person who tests positive has less than a 9% chance of actually having the disease.

Let that sink in. The test has a 95% detection rate and only a 10% false positive rate, and yet a positive result means you probably don't have the disease. How is that possible?

The Base Rate Is the Key

The answer lies in the base rate — the prevalence of the disease. Only 1% of people have it. So out of every 1,000 people tested:

10 have the disease. Of those, 9.5 test positive (true positives).
990 don't have the disease. Of those, 99 test positive (false positives).

That means 108.5 people test positive total, but only 9.5 of them actually have the disease. The false positives massively outnumber the true positives because there are so many more healthy people than sick ones.

Key Term: Base Rate Fallacy

The base rate fallacy (also called base rate neglect) is the error of ignoring the prior probability (base rate) of an event when evaluating evidence. In our medical example, people focus on the test's accuracy (95% sensitivity) and forget that the disease is very rare (1% prevalence). The base rate matters enormously.

This is the threshold concept of this chapter. Let me state it plainly:

Threshold Concept: Bayesian Updating

Probability is not fixed — it changes with evidence. Your initial estimate of a probability (the prior) gets updated when you receive new information (the evidence), producing a revised estimate (the posterior).

Before the test, the probability of disease was 1% (the prior). After a positive test, it jumped to 8.8% (the posterior). The positive test increased the probability almost ninefold — but from a very low base. Evidence matters, but so does where you started.

This is Bayesian updating, and it's how rational reasoning about uncertainty actually works. It's how spam filters learn. It's how AI systems process evidence. And it's how you should think about test results, news stories, and claims backed by data.

9.6 Bayes' Theorem: The Formula

Now let's formalize what the tree diagram showed us.

The Formula

Mathematical Formulation: Bayes' Theorem

$$\boxed{P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}}$$

where $P(B)$ can be expanded using the law of total probability:

$$P(B) = P(B \mid A) \cdot P(A) + P(B \mid \text{not } A) \cdot P(\text{not } A)$$

So the full formula becomes:

$$\boxed{P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B \mid A) \cdot P(A) + P(B \mid \text{not } A) \cdot P(\text{not } A)}}$$

I know that looks intimidating. Let me translate each piece into plain English.

The Plain-English Version

Think of Bayes' theorem as a probability update machine. You feed in three things and get one thing out:

What You Feed In	Symbol	Plain English
Prior probability	$P(A)$	What you believed before seeing any evidence
Likelihood	$P(B \mid A)$	How likely you'd see this evidence if A were true
False alarm rate	$P(B \mid \text{not } A)$	How likely you'd see this evidence if A were NOT true

What You Get Out	Symbol	Plain English
Posterior probability	$P(A \mid B)$	What you should believe after seeing the evidence

Here's the intuition in one sentence: Bayes' theorem tells you how to update your beliefs when you get new evidence.

If the evidence is very likely when A is true ($P(B \mid A)$ is high) and very unlikely when A is false ($P(B \mid \text{not } A)$ is low), then seeing B should make you much more confident in A. But if B is fairly common even when A is false (high false alarm rate), then seeing B shouldn't change your mind as much.

Solving the Disease Screening Problem

Let's solve the puzzle from Section 9.1 with the formula.

Given: - $P(\text{disease}) = 0.001$ (1 in 1,000 people have the disease) - $P(\text{positive} \mid \text{disease}) = 0.99$ (99% sensitivity) - $P(\text{positive} \mid \text{no disease}) = 0.02$ (2% false positive rate)

Find: $P(\text{disease} \mid \text{positive})$

$$P(\text{disease} \mid \text{positive}) = \frac{P(\text{positive} \mid \text{disease}) \cdot P(\text{disease})}{P(\text{positive} \mid \text{disease}) \cdot P(\text{disease}) + P(\text{positive} \mid \text{no disease}) \cdot P(\text{no disease})}$$

$$= \frac{0.99 \times 0.001}{0.99 \times 0.001 + 0.02 \times 0.999}$$

$$= \frac{0.00099}{0.00099 + 0.01998}$$

$$= \frac{0.00099}{0.02097}$$

$$\approx 0.0472$$

About 4.7%. Even with a test that's 99% accurate at detecting the disease and only has a 2% false positive rate, a positive result from a random screening means there's only about a 5% chance you actually have the disease.

The answer to our opening puzzle: not 99%, not 95%, but about 4.7%.

Theme Connection: Superpower for Medical Decisions (Theme 1)

Understanding Bayes' theorem gives you a genuine superpower when navigating medical decisions. If a doctor tells you a test is "99% accurate," you now know to ask: "What's the base rate of the condition?" and "What's the false positive rate?" Those two numbers, combined with Bayes' theorem, will tell you far more than "99% accurate" ever could. This isn't abstract math — it's the difference between panicking over a false alarm and understanding what a test result actually means.

9.7 The Natural Frequency Approach: Bayes Without the Formula

If the formula feels clunky, there's a much more intuitive approach: natural frequencies. Instead of working with probabilities and decimals, translate everything into counts of people.

How It Works

Take our disease screening problem. Instead of thinking about probabilities, imagine testing 100,000 people.

Step 1: Start with 100,000 people.

Step 2: Split by disease status using the base rate. - Disease prevalence: 1 in 1,000 → 100 people have the disease, 99,900 don't.

Step 3: Apply the test to each group. - Of the 100 with disease: 99% test positive → 99 true positives, 1 false negative. - Of the 99,900 without disease: 2% test positive → 1,998 false positives, 97,902 true negatives.

Step 4: Answer the question.

Total positive tests: $99 + 1{,}998 = 2{,}097$

Of those, how many actually have the disease? $99$.

$$P(\text{disease} \mid \text{positive}) = \frac{99}{2{,}097} \approx 0.0472$$

Same answer as the formula — about 4.7%. But this time, no algebra. Just counting.

Why Natural Frequencies Work Better

Research by psychologist Gerd Gigerenzer has shown that people reason about probabilities much more accurately when the information is presented as natural frequencies rather than percentages or decimal probabilities. In studies, physicians who were given the information as "1 out of 1,000 people have the disease" performed dramatically better than those who were told "the prevalence is 0.1%." Our brains evolved to track frequencies, not probabilities.

Rule of thumb: When solving Bayes problems, always try the natural frequency approach first. If you need the formula for a formal calculation, use the frequencies to check your work.

Visualizing It

Here's the natural frequency breakdown as a diagram:

100,000 people tested
├── 100 have disease (base rate = 0.1%)
│   ├── 99 test POSITIVE   (true positives)
│   └── 1  tests negative   (false negative)
└── 99,900 disease-free
    ├── 1,998 test POSITIVE (false positives)
    └── 97,902 test negative (true negatives)

All positive results: 99 + 1,998 = 2,097
Actually have disease: 99 out of 2,097 = 4.7%

The key insight pops right out: false positives outnumber true positives 20 to 1 because the disease is so rare. Even a small false positive rate, applied to a huge group of healthy people, produces a flood of false alarms.

9.8 Medical Testing Vocabulary

Before we go further, let's formalize the vocabulary that medical professionals use. You've already seen these concepts in action — now let's name them.

Key Terms: Sensitivity and Specificity

Sensitivity (also called the true positive rate) is the probability that the test correctly identifies someone who HAS the condition:

$$\text{Sensitivity} = P(\text{positive} \mid \text{disease}) = \frac{\text{true positives}}{\text{true positives} + \text{false negatives}}$$

Specificity (also called the true negative rate) is the probability that the test correctly identifies someone who does NOT have the condition:

$$\text{Specificity} = P(\text{negative} \mid \text{no disease}) = \frac{\text{true negatives}}{\text{true negatives} + \text{false positives}}$$

A few related terms:

Key Terms: False Positive and False Negative

A false positive (Type I error in testing) occurs when the test says "positive" but the person doesn't actually have the condition. The false positive rate = $1 - \text{specificity}$.

A false negative (Type II error in testing) occurs when the test says "negative" but the person actually has the condition. The false negative rate = $1 - \text{sensitivity}$.

Here's the complete 2×2 table for diagnostic testing:

	Has Condition	No Condition
Test Positive	True Positive (TP)	False Positive (FP)
Test Negative	False Negative (FN)	True Negative (TN)

And the key relationships:

Measure	Formula	Question It Answers
Sensitivity	TP / (TP + FN)	"If I'm sick, will the test catch it?"
Specificity	TN / (TN + FP)	"If I'm healthy, will the test say so?"
Positive Predictive Value (PPV)	TP / (TP + FP)	"If I test positive, am I actually sick?"
Negative Predictive Value (NPV)	TN / (TN + FN)	"If I test negative, am I really fine?"

The PPV is the answer Bayes' theorem gives us. It's the probability you actually have the condition given a positive test — and it depends heavily on the base rate.

Why Does This Matter for Maya?

Maya's public health work involves designing screening programs. She now understands a critical trade-off:

High sensitivity means you catch almost everyone who's sick (few false negatives). Great for dangerous diseases where you can't afford to miss anyone.
High specificity means you rarely alarm people who are healthy (few false positives). Great for conditions where a false positive triggers expensive or stressful follow-up.
You can't maximize both without an extremely good test. And even with a great test, the base rate determines how many false alarms you'll generate.

This is why population-wide screening for rare conditions produces so many false positives — and why doctors follow a positive screening result with a more specific confirmatory test.

9.9 Working Through Bayes: Alex's Recommendation Engine

Let's see Bayes' theorem in a different context. Alex Rivera at StreamVibe is building a recommendation system, and he needs to predict whether a user will watch a recommended movie.

Setup: - Overall, 15% of users who are shown a recommendation click and watch it. This is the prior: $P(\text{watch}) = 0.15$. - Among users who watched, 60% had previously watched a movie in the same genre. This is the likelihood: $P(\text{same genre} \mid \text{watch}) = 0.60$. - Among users who didn't watch, only 20% had watched a movie in the same genre previously: $P(\text{same genre} \mid \text{didn't watch}) = 0.20$.

Question: A user has previously watched a movie in the same genre. What's the updated probability that they'll watch the recommendation?

Using Bayes' theorem:

$$P(\text{watch} \mid \text{same genre}) = \frac{P(\text{same genre} \mid \text{watch}) \cdot P(\text{watch})}{P(\text{same genre})}$$

First, calculate $P(\text{same genre})$:

$$P(\text{same genre}) = P(\text{same genre} \mid \text{watch}) \cdot P(\text{watch}) + P(\text{same genre} \mid \text{didn't watch}) \cdot P(\text{didn't watch})$$

$$= 0.60 \times 0.15 + 0.20 \times 0.85 = 0.09 + 0.17 = 0.26$$

Now apply Bayes':

$$P(\text{watch} \mid \text{same genre}) = \frac{0.60 \times 0.15}{0.26} = \frac{0.09}{0.26} \approx 0.346$$

The prior probability of watching was 15%. After learning the user has watched movies in the same genre, it jumped to about 35%. That's a meaningful update — and it's exactly the kind of calculation that powers recommendation engines.

Let's verify with natural frequencies. Imagine 1,000 users:

150 watch (15%). Of those, 90 had same-genre history ($150 \times 0.60$).
850 don't watch. Of those, 170 had same-genre history ($850 \times 0.20$).
Total with same-genre history: $90 + 170 = 260$.
Of those, $90/260 = 0.346$ watched. ✓

9.10 Multiple Updates: Bayes as a Learning Machine

One of the most powerful features of Bayes' theorem is that you can apply it repeatedly. Each piece of evidence updates your probability, and yesterday's posterior becomes today's prior.

Let's extend Alex's example. After learning the user has same-genre history (prior updated from 15% to 34.6%), Alex discovers that the user also spent more than 30 minutes browsing the platform today.

New evidence: - Among users who watched a recommendation, 40% browsed for 30+ minutes: $P(\text{browsed} \mid \text{watch}) = 0.40$. - Among users who didn't watch, 10% browsed that long: $P(\text{browsed} \mid \text{didn't watch}) = 0.10$.

New prior (yesterday's posterior): $P(\text{watch}) = 0.346$.

$$P(\text{browsed}) = 0.40 \times 0.346 + 0.10 \times 0.654 = 0.1384 + 0.0654 = 0.2038$$

$$P(\text{watch} \mid \text{browsed and same genre}) = \frac{0.40 \times 0.346}{0.2038} = \frac{0.1384}{0.2038} \approx 0.679$$

The probability jumped from 15% (no info) to 34.6% (same genre) to 67.9% (same genre + long browse). Each piece of evidence refines the estimate.

This is how AI learns. A spam filter doesn't just check one word — it checks hundreds of features, updating its probability estimate with each one. A medical AI doesn't just look at one symptom — it integrates age, history, lab results, and imaging, each new piece of evidence nudging the probability up or down. The core logic is the same Bayes' theorem you just learned, applied at enormous scale.

Theme Connection: AI Literally Runs on Bayes (Theme 3)

This isn't a metaphor. AI systems from spam filters to self-driving cars are built on Bayesian updating. Here's how:

Spam filters (Naive Bayes classifiers): When an email arrives, the filter starts with a prior probability of spam (maybe 30%, based on historical rates). Then it looks at each word in the email. Words like "free," "winner," "click here," and "congratulations" increase the spam probability. Words like the recipient's name, work-related terms, or a known sender decrease it. Each word triggers a Bayesian update. After processing all the words, the final posterior probability determines whether the email goes to your inbox or spam folder.

The word "naive" in "Naive Bayes" comes from the (naive) assumption that each word is independent of the others. That assumption is technically wrong — "Nigerian" and "prince" are not independent in spam emails — but the classifier works remarkably well anyway.

Large Language Models: When GPT or Claude generates text, it's calculating $P(\text{next word} \mid \text{all previous words})$ — a conditional probability. The model has learned these conditional probabilities from billions of text examples. Every word it generates is the result of a conditional probability calculation.

Understanding Bayes' theorem doesn't just help you pass this class — it helps you understand the technology that's reshaping the world.

9.11 James Washington and Recidivism: Bayes Meets Justice

Professor Washington is using Bayes' theorem to evaluate the predictive policing algorithm from a new angle. The algorithm claims to identify individuals at high risk of re-offending — but what does "high risk" actually mean in Bayesian terms?

Setup (from Washington's expanded dataset): - The base rate of re-offense in the studied population is 20%: $P(\text{re-offense}) = 0.20$. - The algorithm flags 75% of people who will re-offend as "high risk": $P(\text{high risk} \mid \text{re-offense}) = 0.75$ (the algorithm's sensitivity). - The algorithm flags 22% of people who will NOT re-offend as "high risk": $P(\text{high risk} \mid \text{no re-offense}) = 0.22$ (the false positive rate, or $1 - \text{specificity}$).

Question: If someone is flagged as "high risk," what's the probability they'll actually re-offend?

Natural frequency approach — imagine 10,000 people:

2,000 will re-offend (20%). Of those, 1,500 are flagged high risk ($2{,}000 \times 0.75$).
8,000 won't re-offend. Of those, 1,760 are flagged high risk ($8{,}000 \times 0.22$).
Total flagged high risk: $1{,}500 + 1{,}760 = 3{,}260$.
Actual re-offenders among those flagged: $1{,}500 / 3{,}260 \approx 0.460$ or 46%.

Using the formula:

$$P(\text{re-offense} \mid \text{high risk}) = \frac{0.75 \times 0.20}{0.75 \times 0.20 + 0.22 \times 0.80} = \frac{0.15}{0.15 + 0.176} = \frac{0.15}{0.326} \approx 0.460$$

"This algorithm," Washington tells his research team, "labels someone 'high risk,' and there's less than a 50% chance they'll actually re-offend. That means the majority of people flagged 'high risk' are receiving a label — and potentially facing restrictions — for something they won't do."

He continues: "And here's what makes it worse. The base rate of re-offense might differ across racial groups, but the algorithm uses the same thresholds for everyone. If the base rate of re-offense is lower in one group, then the positive predictive value of a 'high risk' label is even lower for that group — meaning a larger proportion of 'high risk' labels are wrong."

This is Bayes' theorem applied to justice. The math doesn't tell you what policy to adopt — but it does tell you the true accuracy of the prediction. And that's information a democracy needs.

Spaced Review 3: Multiplication Rule for Independent Events (Ch. 8)

In Chapter 8, you learned the multiplication rule for independent events: $P(A \text{ and } B) = P(A) \times P(B)$. That rule works only when events are independent — when knowing one event doesn't change the probability of the other. Conditional probability generalizes this. When events are NOT independent, the general multiplication rule is:

$$P(A \text{ and } B) = P(A) \times P(B \mid A)$$

In other words, the probability of A and B happening together equals the probability of A times the probability of B given that A already happened. The independent multiplication rule from Chapter 8 is a special case: when $P(B \mid A) = P(B)$ (knowing A doesn't change B), the general rule simplifies to $P(A) \times P(B)$.

9.12 Bayes' Theorem in Python

Let's build a reusable Python function for Bayesian updating.

import pandas as pd
import numpy as np

def bayes_theorem(prior, likelihood, false_alarm):
    """
    Calculate the posterior probability using Bayes' theorem.

    Parameters:
    -----------
    prior : float
        P(A) — the prior probability of the hypothesis
    likelihood : float
        P(B|A) — probability of evidence given hypothesis is true
    false_alarm : float
        P(B|not A) — probability of evidence given hypothesis is false

    Returns:
    --------
    float : P(A|B) — the posterior probability
    """
    p_evidence = likelihood * prior + false_alarm * (1 - prior)
    posterior = (likelihood * prior) / p_evidence
    return posterior

# --- Example 1: Disease screening (Section 9.6) ---
posterior = bayes_theorem(
    prior=0.001,        # 1 in 1,000 have disease
    likelihood=0.99,    # 99% sensitivity
    false_alarm=0.02    # 2% false positive rate
)
print(f"Disease screening:")
print(f"  P(disease | positive test) = {posterior:.4f}")
print(f"  That's {posterior*100:.1f}%, not 99%!\n")

# --- Example 2: Alex's recommendation engine ---
posterior_rec = bayes_theorem(
    prior=0.15,         # 15% base watch rate
    likelihood=0.60,    # 60% of watchers had same-genre history
    false_alarm=0.20    # 20% of non-watchers had same-genre history
)
print(f"Recommendation engine:")
print(f"  P(watch | same genre history) = {posterior_rec:.4f}\n")

# --- Example 3: Washington's algorithm ---
posterior_justice = bayes_theorem(
    prior=0.20,         # 20% base re-offense rate
    likelihood=0.75,    # 75% sensitivity
    false_alarm=0.22    # 22% false positive rate
)
print(f"Recidivism algorithm:")
print(f"  P(re-offense | flagged high risk) = {posterior_justice:.4f}\n")

# --- Natural Frequency Visualization ---
def natural_frequency_table(prior, likelihood, false_alarm,
                            population=100000):
    """Display a natural frequency breakdown."""
    have_condition = int(population * prior)
    no_condition = population - have_condition

    true_pos = int(have_condition * likelihood)
    false_neg = have_condition - true_pos
    false_pos = int(no_condition * false_alarm)
    true_neg = no_condition - false_pos

    total_pos = true_pos + false_pos
    ppv = true_pos / total_pos if total_pos > 0 else 0

    print(f"Out of {population:,} people:")
    print(f"  {have_condition:,} have the condition")
    print(f"    → {true_pos:,} test positive (true positives)")
    print(f"    → {false_neg:,} test negative (false negatives)")
    print(f"  {no_condition:,} don't have the condition")
    print(f"    → {false_pos:,} test positive (false positives)")
    print(f"    → {true_neg:,} test negative (true negatives)")
    print(f"\n  Total positive: {total_pos:,}")
    print(f"  P(condition | positive) = {true_pos:,}/{total_pos:,}"
          f" = {ppv:.4f} ({ppv*100:.1f}%)")

print("--- Natural Frequency Breakdown ---")
natural_frequency_table(prior=0.001, likelihood=0.99, false_alarm=0.02)

Computing Conditional Probabilities from a DataFrame

Here's how to use pd.crosstab and .groupby() for Bayesian analysis with real data:

import pandas as pd
import numpy as np

# Simulated data: 10,000 email messages
np.random.seed(42)
n = 10000

# 30% of emails are spam
is_spam = np.random.choice([1, 0], size=n, p=[0.30, 0.70])

# Word "free" appears in 65% of spam, 5% of legitimate email
contains_free = np.where(
    is_spam == 1,
    np.random.choice([1, 0], size=n, p=[0.65, 0.35]),
    np.random.choice([1, 0], size=n, p=[0.05, 0.95])
)

email_df = pd.DataFrame({
    'is_spam': is_spam,
    'contains_free': contains_free
})

# Contingency table
print("Contingency Table:")
ct = pd.crosstab(email_df['is_spam'], email_df['contains_free'],
                 margins=True,
                 rownames=['Spam'], colnames=['Contains "free"'])
print(ct)
print()

# Conditional probability: P(spam | contains "free")
# Method: filter to emails containing "free", then check spam rate
has_free = email_df[email_df['contains_free'] == 1]
p_spam_given_free = has_free['is_spam'].mean()
print(f"P(spam | contains 'free') = {p_spam_given_free:.4f}")

# Compare with P(spam | no "free")
no_free = email_df[email_df['contains_free'] == 0]
p_spam_given_no_free = no_free['is_spam'].mean()
print(f"P(spam | no 'free')       = {p_spam_given_no_free:.4f}")
print()

# Bayes calculation for comparison
prior_spam = email_df['is_spam'].mean()
p_free_given_spam = email_df[email_df['is_spam']==1]['contains_free'].mean()
p_free_given_legit = email_df[email_df['is_spam']==0]['contains_free'].mean()

posterior = bayes_theorem(prior_spam, p_free_given_spam, p_free_given_legit)
print(f"Bayes calculation: P(spam | 'free') = {posterior:.4f}")
print(f"Direct calculation matches Bayes!")

9.13 Prior and Posterior: The Language of Updating

Let me formalize two terms you've been seeing throughout this chapter.

Key Terms: Prior and Posterior Probability

The prior probability (or simply the prior) is your probability estimate before seeing new evidence. It represents your starting belief.

The posterior probability (or simply the posterior) is your updated probability after incorporating new evidence. It represents your revised belief.

Bayes' theorem is the mathematical bridge between the two:

$$\text{Prior} + \text{Evidence} \xrightarrow{\text{Bayes}} \text{Posterior}$$

The beauty of this framework is that it's iterative. Today's posterior becomes tomorrow's prior when new evidence arrives. This is exactly how:

Medical diagnosis works: a doctor starts with a base rate, updates with symptoms, then updates again with test results.
Weather forecasting works: models start with climatological averages and update with satellite data, radar, and surface observations.
Machine learning works: models start with initial parameter estimates and update with training data.
Scientific reasoning works: researchers start with prior knowledge and update with experimental results.

How Strong Is the Evidence?

Not all evidence is created equal. The strength of a Bayesian update depends on the ratio of the likelihood to the false alarm rate:

$$\text{Likelihood ratio} = \frac{P(B \mid A)}{P(B \mid \text{not } A)}$$

If this ratio is much greater than 1, the evidence strongly supports A.
If this ratio is close to 1, the evidence is weak — equally compatible with A or not-A.
If this ratio is less than 1, the evidence actually makes A less likely.

In our disease example: $\frac{0.99}{0.02} = 49.5$. A positive test makes disease 49.5 times more likely than before. That's strong evidence! But it started from such a low prior (0.001) that even a 49.5× boost only gets you to about 4.7%.

In the spam example: $\frac{0.65}{0.05} = 13$. Finding the word "free" makes spam 13 times more likely. Starting from a 30% prior, that's enough to push the posterior above 80%.

The lesson: strong evidence means more when the prior isn't too extreme.

9.14 The Base Rate Fallacy: A Vivid Example

The base rate fallacy is so common and so consequential that it deserves its own section.

The Rare Disease Example (Extended)

Imagine a country implements universal screening for a disease that affects 1 in 10,000 people. The test has 99% sensitivity and 99% specificity — an excellent test by any standard.

If you screen 10 million people:

1,000 have the disease. Of those, 990 test positive (true positives).
9,999,000 don't have the disease. Of those, 99,990 test positive (false positives).
Total positive: $990 + 99{,}990 = 100{,}980$.
PPV: $990 / 100{,}980 \approx 0.0098$ or about 1%.

Even with a 99% accurate test, a positive result means only a 1% chance of actually having the disease. The false positives outnumber the true positives 100 to 1.

This is why public health agencies don't recommend universal screening for very rare conditions — the math guarantees a flood of false alarms, each one causing unnecessary anxiety, invasive follow-up testing, and wasted medical resources.

When the Base Rate Changes Everything

Here's a table showing how the positive predictive value (PPV) changes with disease prevalence, holding the test accuracy constant (sensitivity = 99%, specificity = 99%):

Disease Prevalence	P(disease \| positive test)	False Positives per True Positive
1 in 10 (10%)	91.7%	0.09
1 in 100 (1%)	50.0%	1
1 in 1,000 (0.1%)	9.0%	10
1 in 10,000 (0.01%)	1.0%	100
1 in 100,000 (0.001%)	0.1%	1,000

The same test — identical accuracy — produces wildly different results depending on who you test. When the disease is common (10%), a positive result is very reliable (91.7%). When the disease is rare (0.001%), a positive result is almost meaningless (0.1%).

Theme Connection: Updating Beliefs with Evidence (Theme 4)

The base rate fallacy reveals something deep about how human minds process uncertainty. We're drawn to the dramatic, specific information (a 99% accurate test said positive!) and we neglect the boring, general information (this disease affects 1 in 10,000 people). But the boring information is often the most important.

This isn't just a medical issue. It's how people misjudge the probability that a Muslim person is a terrorist, that a Black driver is carrying drugs, or that a young person with tattoos is a criminal. When you start with an extremely low base rate and ignore it in favor of superficially dramatic "evidence," you end up with conclusions that are statistically — and morally — wrong.

9.15 Tree Diagrams: A Complete Worked Example

Let's work through a complete tree diagram from start to finish with a new problem.

Alex's A/B Test With User Segments

Alex is running a new feature on StreamVibe and wants to predict user churn. He has data on two user segments:

Heavy users (watch 5+ hours/week): make up 35% of users.
Light users (watch less than 5 hours/week): make up 65% of users.

Among heavy users, 5% churn within 3 months. Among light users, 18% churn within 3 months.

Build the tree:

                        All Users
                       /          \
                  Heavy             Light
                 P = 0.35          P = 0.65
                /       \         /        \
           Churn     Stay     Churn      Stay
          P=0.05   P=0.95    P=0.18    P=0.82
            |        |         |          |
         0.0175   0.3325    0.117      0.533

Path probabilities (multiply along branches):

Path	Calculation	Result
Heavy and Churn	$0.35 \times 0.05$	0.0175
Heavy and Stay	$0.35 \times 0.95$	0.3325
Light and Churn	$0.65 \times 0.18$	0.1170
Light and Stay	$0.65 \times 0.82$	0.5330
Total		1.0000 ✓

Questions we can now answer:

Q1: What's the overall probability of churn?

$$P(\text{churn}) = P(\text{heavy and churn}) + P(\text{light and churn}) = 0.0175 + 0.117 = 0.1345$$

About 13.45% of all users churn. (This is the law of total probability in action.)

Q2: If a user churned, what's the probability they were a light user?

$$P(\text{light} \mid \text{churn}) = \frac{P(\text{light and churn})}{P(\text{churn})} = \frac{0.117}{0.1345} = 0.870$$

87% of churners were light users. Even though light users have a higher churn rate, this confirms that the vast majority of users who actually churn are light users — because there are so many more of them.

Q3: If a user churned, what's the probability they were a heavy user?

$$P(\text{heavy} \mid \text{churn}) = \frac{0.0175}{0.1345} = 0.130$$

Only 13% of churners were heavy users. Alex concludes: to reduce churn, focus retention efforts on light users. That's where the volume is.

9.16 Independence Revisited Through Conditional Probability

Now that you understand conditional probability, we can give a more precise definition of independence.

Formal Definition: Independence

Events A and B are independent if and only if:

$$P(A \mid B) = P(A)$$

In words: knowing that B occurred doesn't change the probability of A. Equivalently, $P(B \mid A) = P(B)$.

Let's test this with Maya's data. Are smoking and respiratory illness independent?

$$P(\text{illness}) = \frac{110}{500} = 0.22$$

$$P(\text{illness} \mid \text{smoker}) = \frac{68}{160} = 0.425$$

Since $0.425 \neq 0.22$, knowing someone is a smoker does change the probability of illness. These events are not independent.

If they were independent, the conditional probability would equal the marginal probability: $P(\text{illness} \mid \text{smoker}) = P(\text{illness}) = 0.22$. That would mean being a smoker doesn't affect illness rates — clearly not the case.

Connection to Chapter 8: In Chapter 8, we used the multiplication rule $P(A \text{ and } B) = P(A) \times P(B)$ for independent events. We can now see why that rule works: when $P(B \mid A) = P(B)$, the general multiplication rule $P(A \text{ and } B) = P(A) \times P(B \mid A)$ simplifies to $P(A) \times P(B)$. The independent multiplication rule was a special case all along.

9.17 Progressive Project Checkpoint: Conditional Probabilities in Your Dataset

It's time to bring conditional probability and Bayes' theorem to your own Data Detective Portfolio.

Your Task

Return to your contingency table from Chapter 8. Using the same two categorical variables, calculate at least two conditional probabilities and interpret them in context.

import pandas as pd

# Load your clean dataset
df = pd.read_csv('your_clean_dataset.csv')

# Create contingency table (review from Ch.8)
ct = pd.crosstab(df['variable_1'], df['variable_2'], margins=True)
print("Contingency table:")
print(ct)
print()

# Conditional probabilities using normalize='index' (row-wise)
cond_probs = pd.crosstab(df['variable_1'], df['variable_2'],
                          normalize='index')
print("Conditional probabilities (by row):")
print(cond_probs.round(4))
print()

# Direct calculation for specific conditional probability
# P(variable_2 = X | variable_1 = Y)
subset = df[df['variable_1'] == 'Y']
p_conditional = (subset['variable_2'] == 'X').mean()
print(f"P(variable_2 = X | variable_1 = Y) = {p_conditional:.4f}")

Demonstrate that P(A|B) ≠ P(B|A) using your data. Calculate both directions and explain what each one means in the context of your dataset.
Apply Bayes' theorem to a question about your data. You might: - Treat one variable as the "test" and the other as the "condition" - Use natural frequencies: "Out of 1,000 people like those in my dataset..." - Compare the prior probability to the posterior probability
Draw a tree diagram (by hand or in your notebook) for one conditional probability problem in your data.
Write a paragraph connecting this analysis to your research question. Does conditioning on one variable reveal something the unconditional probabilities hid?

Example: What Good Output Looks Like

"In the BRFSS dataset, I calculated P(current smoker | poor/fair health) = 0.276 and P(poor/fair health | current smoker) = 0.321. These are different! The first says that among people in poor/fair health, about 28% smoke. The second says that among current smokers, about 32% report poor/fair health. The asymmetry shows that poor health and smoking are associated but in different proportions depending on which direction you look. Using Bayes' theorem with a prior of P(smoker) = 0.18, I found that learning someone is in poor/fair health increases the probability of being a smoker from 18% to 27.6% — a meaningful update that supports the hypothesis of a smoking-health association."

9.18 Chapter Summary

Let's take stock of what you've accomplished in this chapter — because it's a lot.

You've learned conditional probability ($P(A \mid B)$), the idea that probabilities change when you have new information. You've seen how to calculate it from contingency tables by restricting your universe to just the "given" condition.

You've learned why P(A|B) ≠ P(B|A) — and why confusing the two is called the prosecutor's fallacy. You've seen how this confusion can have real consequences in medicine, criminal justice, and everyday reasoning.

You've learned Bayes' theorem, the mathematical formula that translates evidence into updated beliefs. You can express it as a formula:

$$P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}$$

Or as the more intuitive natural frequency approach: "Imagine 10,000 people..."

You've seen why base rates matter — that even excellent tests produce false alarms when the underlying condition is rare. And you've seen that the base rate fallacy, the tendency to ignore prior probabilities, is one of the most common reasoning errors humans make.

You've constructed tree diagrams to visualize multi-step probability problems and trace every possible path from start to finish.

And you've seen that Bayes' theorem is the engine of AI: spam filters, recommendation systems, language models, and criminal justice algorithms all use Bayesian updating at their core.

The threshold concept of this chapter — probability is not fixed; it changes with evidence — is the conceptual shift that separates casual probability thinking from the kind of reasoning that actually works in the real world. Every confidence interval (Chapter 12), hypothesis test (Chapter 13), and regression model (Chapters 22-24) you'll encounter builds on this foundation.

What's Next

In Chapter 10, we'll make a different kind of leap. So far, we've been working with discrete events — things that either happen or don't. But many variables in the real world are continuous: heights, weights, test scores, incomes. How do you apply probability to a variable that can take infinitely many values? The answer involves probability distributions — and one distribution in particular, the normal curve, that will change how you think about data forever.

Key Formulas at a Glance

Concept	Formula	When to Use
Conditional probability	$P(A \mid B) = \frac{P(A \text{ and } B)}{P(B)}$	Finding probability given new information
General multiplication rule	$P(A \text{ and } B) = P(A) \times P(B \mid A)$	Finding joint probability when events are NOT independent
Bayes' theorem	$P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}$	Updating probability with evidence
Law of total probability	$P(B) = P(B \mid A) \cdot P(A) + P(B \mid A') \cdot P(A')$	Finding overall probability by combining branches
Sensitivity	$P(\text{pos} \mid \text{disease})$	Test's ability to detect true positives
Specificity	$P(\text{neg} \mid \text{no disease})$	Test's ability to detect true negatives
PPV (Bayes applied)	$\frac{P(\text{pos} \mid \text{disease}) \cdot P(\text{disease})}{P(\text{pos})}$	"If I test positive, what's the real probability I'm sick?"

	Re-Offense	No Re-Offense	Total
High Risk	120	180	300
Low Risk	40	460	500
Total	160	640	800

Prerequisites

Learning Objectives

In This Chapter

Chapter 9: Conditional Probability and Bayes' Theorem

Chapter Overview

9.1 A Puzzle Before We Start (Productive Struggle)

9.2 Conditional Probability: The Basics

What Is Conditional Probability?

Calculating Conditional Probability from a Contingency Table

The Formal Definition

Reading the Notation

9.3 The Critical Distinction: P(A|B) vs. P(B|A)

Why This Matters: The Prosecutor's Fallacy

More Examples of the Asymmetry

9.4 Conditional Probability in Python

9.5 Tree Diagrams: Seeing the Branches

Building a Tree Diagram: Maya's Disease Screening

The Base Rate Is the Key

9.6 Bayes' Theorem: The Formula

The Formula

The Plain-English Version

Solving the Disease Screening Problem

9.7 The Natural Frequency Approach: Bayes Without the Formula

How It Works

Visualizing It

9.8 Medical Testing Vocabulary

Why Does This Matter for Maya?

9.9 Working Through Bayes: Alex's Recommendation Engine

9.10 Multiple Updates: Bayes as a Learning Machine

9.11 James Washington and Recidivism: Bayes Meets Justice

9.12 Bayes' Theorem in Python

Computing Conditional Probabilities from a DataFrame

9.13 Prior and Posterior: The Language of Updating

How Strong Is the Evidence?

9.14 The Base Rate Fallacy: A Vivid Example

The Rare Disease Example (Extended)

When the Base Rate Changes Everything

9.15 Tree Diagrams: A Complete Worked Example

Alex's A/B Test With User Segments

9.16 Independence Revisited Through Conditional Probability

9.17 Progressive Project Checkpoint: Conditional Probabilities in Your Dataset

Your Task

Example: What Good Output Looks Like

9.18 Chapter Summary

What's Next

Key Formulas at a Glance

Related Reading