22 min read

> "The theory of probability is at bottom nothing but common sense reduced to calculus."

Learning Objectives

  • Define probability and distinguish between theoretical, empirical, and subjective interpretations
  • Simulate random experiments in Python to estimate probabilities
  • Apply the complement rule, addition rule, and multiplication rule
  • Compute and interpret conditional probability and apply Bayes' theorem
  • Explain the law of large numbers and demonstrate it through simulation

Chapter 20: Probability Thinking — Uncertainty, Randomness, and Why Your Intuition Lies

"The theory of probability is at bottom nothing but common sense reduced to calculus." — Pierre-Simon Laplace


Chapter Overview

Here's a question: if you flip a fair coin and get heads five times in a row, what are the chances of getting heads on the next flip?

If something in your gut says "tails is due," you're in good company. Most humans feel that way. And most humans are wrong.

The coin doesn't know it came up heads five times. It doesn't have a memory. The probability of heads on the next flip is exactly what it was on every other flip: 50%. The feeling that tails is "due" is called the gambler's fallacy, and it's just one of the many ways our brains systematically misunderstand randomness.

This chapter is about learning to think about uncertainty clearly — not by memorizing formulas, but by simulating random processes in Python and watching the patterns emerge. We'll flip coins, roll dice, draw cards, and sample data. We'll see why the "law of averages" works in the long run but fails in the short run. We'll learn Bayes' theorem — one of the most powerful ideas in all of data science — by working through a medical testing example that will change how you read health statistics.

And throughout it all, we'll confront the uncomfortable truth that human intuition about probability is, frankly, terrible. The good news: once you learn to think probabilistically, the world makes a lot more sense.

In this chapter, you will learn to:

  1. Define probability and distinguish between theoretical, empirical, and subjective interpretations (all paths)
  2. Simulate random experiments in Python to estimate probabilities (all paths)
  3. Apply the complement rule, addition rule, and multiplication rule (all paths)
  4. Compute and interpret conditional probability and apply Bayes' theorem (standard + deep dive paths)
  5. Explain the law of large numbers and demonstrate it through simulation (all paths)

A note about math anxiety: If the word "probability" makes you tense up, take a breath. We're going to do this differently than your high school math class did. Every concept starts with a simulation — you'll see it work before we name it. Every formula gets a plain-English translation. And I promise: if you can write a for loop in Python, you can do probability. Let's go.


20.1 What Is Probability? Three Ways to Think About It

Before we compute anything, let's agree on what probability actually means. There are three ways to think about it, and all three are useful.

Interpretation 1: Classical (Theoretical) Probability

Count the favorable outcomes and divide by the total outcomes.

A fair die has 6 faces. The probability of rolling a 4 is 1/6. Done. This works when all outcomes are equally likely, and when you can count them.

Interpretation 2: Frequentist (Empirical) Probability

The proportion of times something happens if you repeat the experiment many, many times.

If you flip a coin 10,000 times and get heads 5,023 times, the empirical probability of heads is 5,023/10,000 = 0.5023. This is close to 0.5, and it would get closer the more flips you do.

Interpretation 3: Subjective (Bayesian) Probability

Your personal degree of belief that something will happen, based on available evidence.

"I think there's a 70% chance it will rain tomorrow." This isn't based on counting equally likely outcomes or on repeating the experiment thousands of times — it's a judgment based on what you know (the weather forecast, the clouds you see, the season). This interpretation is central to Bayesian statistics, which we'll encounter later in the book.

All three interpretations produce numbers between 0 and 1 (or 0% and 100%), where: - 0 means impossible - 1 means certain - 0.5 means equally likely to happen or not happen

# Let's start with the basics
print("Probability must be between 0 and 1:")
print(f"  Impossible event (rolling 7 on a standard die): P = 0")
print(f"  Certain event (rolling 1-6 on a standard die):  P = 1")
print(f"  Fair coin heads:                                  P = 0.5")

Key Vocabulary

Before we start simulating, let's nail down the language:

  • Experiment: A process whose outcome is uncertain (flipping a coin, drawing a card, sampling a country from the WHO dataset)
  • Sample space: The set of ALL possible outcomes. For a die: {1, 2, 3, 4, 5, 6}
  • Event: A specific outcome or set of outcomes we're interested in. "Rolling an even number" = {2, 4, 6}
  • Probability of an event: A number between 0 and 1 representing how likely the event is

20.2 Simulating Probability in Python

Here's our philosophy: simulate first, formalize later. Instead of starting with formulas, we'll generate thousands of random outcomes in Python and let the patterns teach us.

Python's random module and NumPy's np.random are our tools for creating randomness.

Flipping a Coin

import numpy as np
import matplotlib.pyplot as plt

# Flip one coin
np.random.seed(42)
flip = np.random.choice(['Heads', 'Tails'])
print(f"One flip: {flip}")

# Flip 10 coins
ten_flips = np.random.choice(['Heads', 'Tails'], size=10)
print(f"Ten flips: {ten_flips}")
print(f"Heads count: {sum(ten_flips == 'Heads')}")

# Flip 10,000 coins
many_flips = np.random.choice(['Heads', 'Tails'], size=10000)
heads_count = sum(many_flips == 'Heads')
print(f"\n10,000 flips: {heads_count} heads ({heads_count/10000:.4f})")

With 10 flips, you might get 6 heads and 4 tails, or 3 heads and 7 tails — small samples are noisy. With 10,000 flips, you'll get something very close to 5,000 heads. This is your first glimpse of the law of large numbers, which we'll explore in depth later.

Rolling Dice

# Roll one die
roll = np.random.randint(1, 7)  # 1 to 6 (upper bound is exclusive)
print(f"One roll: {roll}")

# Roll 10,000 dice and estimate P(rolling a 6)
rolls = np.random.randint(1, 7, size=10000)
prob_six = np.mean(rolls == 6)
print(f"P(rolling 6) from simulation: {prob_six:.4f}")
print(f"P(rolling 6) from theory:     {1/6:.4f}")

# Roll two dice and find P(sum = 7)
die1 = np.random.randint(1, 7, size=100000)
die2 = np.random.randint(1, 7, size=100000)
total = die1 + die2
prob_seven = np.mean(total == 7)
print(f"\nP(sum of two dice = 7) from simulation: {prob_seven:.4f}")
print(f"P(sum of two dice = 7) from theory:     {6/36:.4f}")

Notice something beautiful: we didn't need any probability formulas. We just generated random outcomes, counted what we cared about, and divided by the total. This is Monte Carlo simulation — estimating probabilities by running random experiments many times. It's one of the most powerful techniques in data science.

The General Pattern

def estimate_probability(experiment_func, n_simulations=100000):
    """
    Estimate the probability of an event by simulation.

    experiment_func: a function that returns True if the event occurred
    n_simulations: how many times to repeat the experiment
    """
    successes = sum(experiment_func() for _ in range(n_simulations))
    return successes / n_simulations

# Example: P(rolling doubles with two dice)
def roll_doubles():
    die1 = np.random.randint(1, 7)
    die2 = np.random.randint(1, 7)
    return die1 == die2

prob_doubles = estimate_probability(roll_doubles)
print(f"P(doubles) from simulation: {prob_doubles:.4f}")
print(f"P(doubles) from theory:     {6/36:.4f}")

Thinking About Probability with Natural Frequencies

Before diving into rules and formulas, let me share a trick that makes probability much easier to think about. Instead of thinking in percentages or fractions, think in natural frequencies — actual counts of people or events.

Which of these is easier to understand?

  • "The probability of having the disease given a positive test is 4.7%."
  • "Out of every 1,000 people who test positive, about 47 actually have the disease."

Most people find the second one much clearer. Research by psychologist Gerd Gigerenzer has shown that people reason much better about probability when it's presented as natural frequencies rather than abstract percentages.

Throughout this chapter, whenever a probability feels confusing, try translating it into natural frequencies. "P = 0.02" becomes "about 2 out of every 100." "P = 0.001" becomes "about 1 in every 1,000." This simple trick can save you from many probability errors.

# Natural frequency thinking
# Instead of P(heads) = 0.5, think:
# "Out of 1000 flips, about 500 will be heads"

# Instead of P(two heads in a row) = 0.25, think:
# "Out of 1000 pairs of flips, about 250 will be double heads"

# Let's verify
np.random.seed(42)
n_pairs = 10000
pair_results = np.random.choice([0, 1], size=(n_pairs, 2))
both_heads = np.sum(np.all(pair_results == 1, axis=1))
print(f"Out of {n_pairs} pairs of flips, {both_heads} were double heads")
print(f"That's about {both_heads / n_pairs * 1000:.0f} out of every 1000")

20.3 Basic Probability Rules — Through Simulation

Now that we can simulate, let's discover the fundamental rules of probability. We'll find them empirically first, then name them.

Rule 1: The Complement

The complement of an event is "everything that's NOT that event." If the probability of rain is 0.3, the probability of no rain is 0.7. They always add up to 1.

$$P(\text{not } A) = 1 - P(A)$$

# Simulate: P(NOT rolling a 6)
rolls = np.random.randint(1, 7, size=100000)
prob_not_six = np.mean(rolls != 6)
print(f"P(not a 6) from simulation: {prob_not_six:.4f}")
print(f"P(not a 6) from complement: {1 - 1/6:.4f}")

In English: if you know the probability of something happening, the probability of it NOT happening is just 1 minus that. This sounds trivial, but it's surprisingly useful — sometimes it's easier to compute the probability of the complement and then subtract from 1.

Example: What's the probability of getting at least one 6 in four rolls? Instead of computing "exactly one 6" + "exactly two 6s" + "exactly three 6s" + "all four are 6s," use the complement: P(at least one 6) = 1 - P(zero 6s in four rolls).

# Simulate: P(at least one 6 in four rolls)
n_sim = 100000
at_least_one = 0
for _ in range(n_sim):
    four_rolls = np.random.randint(1, 7, size=4)
    if 6 in four_rolls:
        at_least_one += 1

prob_sim = at_least_one / n_sim
prob_complement = 1 - (5/6)**4

print(f"P(at least one 6 in 4 rolls):")
print(f"  Simulation: {prob_sim:.4f}")
print(f"  Complement: {prob_complement:.4f}")

Rule 2: The Addition Rule (OR)

What's the probability of event A or event B happening?

If the events can't happen at the same time (they're mutually exclusive), just add:

$$P(A \text{ or } B) = P(A) + P(B)$$

If they can overlap, subtract the overlap to avoid double-counting:

$$P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B)$$

# Simulate: P(rolling a 1 OR a 6) — mutually exclusive
rolls = np.random.randint(1, 7, size=100000)
prob_1_or_6 = np.mean((rolls == 1) | (rolls == 6))
print(f"P(1 or 6) simulation: {prob_1_or_6:.4f}")
print(f"P(1 or 6) theory:     {1/6 + 1/6:.4f}")

# Simulate: P(even OR greater than 4) — overlapping events
# Even: {2, 4, 6}, Greater than 4: {5, 6}
# Overlap: {6}
prob_even_or_gt4 = np.mean((rolls % 2 == 0) | (rolls > 4))
prob_theory = 3/6 + 2/6 - 1/6  # 4/6
print(f"\nP(even or >4) simulation: {prob_even_or_gt4:.4f}")
print(f"P(even or >4) theory:     {prob_theory:.4f}")

Rule 3: The Multiplication Rule (AND)

What's the probability of A and B both happening?

If the events are independent — one doesn't affect the other — multiply:

$$P(A \text{ and } B) = P(A) \times P(B)$$

# Simulate: P(two heads in a row)
n_sim = 100000
two_heads = 0
for _ in range(n_sim):
    flip1 = np.random.choice(['H', 'T'])
    flip2 = np.random.choice(['H', 'T'])
    if flip1 == 'H' and flip2 == 'H':
        two_heads += 1

prob_sim = two_heads / n_sim
print(f"P(two heads) simulation: {prob_sim:.4f}")
print(f"P(two heads) theory:     {0.5 * 0.5:.4f}")

Independence means that knowing the outcome of one event doesn't change the probability of the other. Coin flips are independent — the coin has no memory. But drawing cards from a deck without replacement is NOT independent — if you drew an ace first, there are fewer aces left.

# NOT independent: P(two aces from a deck without replacement)
n_sim = 100000
two_aces = 0
for _ in range(n_sim):
    deck = list(range(52))  # 0-51, where 0-3 are aces
    card1 = np.random.choice(deck)
    deck.remove(card1)
    card2 = np.random.choice(deck)
    if card1 < 4 and card2 < 4:  # Both are aces
        two_aces += 1

prob_sim = two_aces / n_sim
prob_theory = (4/52) * (3/51)  # Second draw depends on first
print(f"P(two aces without replacement):")
print(f"  Simulation: {prob_sim:.4f}")
print(f"  Theory:     {prob_theory:.4f}")

Understanding Independence: A Deeper Look

The concept of independence deserves more attention, because getting it wrong leads to serious errors — as we'll see in Case Study 1.

Two events are independent if knowing one occurred doesn't change the probability of the other. But how do you determine independence in practice?

Mathematical test: Events A and B are independent if and only if P(A and B) = P(A) * P(B).

Intuitive test: Ask yourself, "Does knowing that A happened give me any information about B?" If the answer is no, they're independent.

Let's work through some subtler examples:

# Example 1: Independent - weather in Paris vs. weather in Tokyo
# These cities are far enough apart that their weather patterns
# are essentially independent (on any given day).

# Example 2: NOT independent - weather today vs. weather tomorrow
# If it's raining today, it's more likely to rain tomorrow than
# if it's sunny today (weather patterns persist).

# Example 3: Tricky - drawing cards WITH replacement
# With replacement, each draw is independent — the deck resets.
# Without replacement, draws are NOT independent.

# Let's verify independence with cards
np.random.seed(42)
n_sims = 100000

# WITH replacement
hearts_with = 0
for _ in range(n_sims):
    card1 = np.random.randint(0, 52)  # 0-12 are hearts
    card2 = np.random.randint(0, 52)  # New full deck
    if card1 < 13 and card2 < 13:
        hearts_with += 1

# WITHOUT replacement
hearts_without = 0
for _ in range(n_sims):
    cards = np.random.choice(52, size=2, replace=False)
    if cards[0] < 13 and cards[1] < 13:
        hearts_without += 1

print("P(two hearts):")
print(f"  With replacement:    {hearts_with/n_sims:.4f} "
      f"(theory: {(13/52)**2:.4f})")
print(f"  Without replacement: {hearts_without/n_sims:.4f} "
      f"(theory: {(13/52)*(12/51):.4f})")
print(f"  The difference is small but real — and it matters!")

The independence assumption is particularly dangerous when applied to human events. People in the same family are NOT independent (they share genes, environment, and social context). Students in the same class are NOT independent (they share the same instructor, curriculum, and peer group). Countries in the same region are NOT independent (they share geography, trade patterns, and political influences).

Elena's lesson: Early in her analysis, Elena assumed that vaccination rates in neighboring counties were independent. They weren't — counties share health departments, media markets, and cultural attitudes. Treating correlated observations as independent led her to underestimate uncertainty in her early reports. She learned to check for spatial correlation.


20.4 Conditional Probability — When Context Changes Everything

This is where probability gets really interesting — and where human intuition really starts to fail.

Conditional probability is the probability of an event given that some other event has already occurred. It's written as P(A | B) and pronounced "the probability of A given B."

Here's an example that matters. Imagine you're looking at Elena's vaccination data:

  • 40% of countries in the dataset are "high income"
  • 90% of high-income countries have vaccination rates above 80%
  • 60% of ALL countries have vaccination rates above 80%

Now here's the question: if I tell you a randomly selected country has a vaccination rate above 80%, what's the probability that it's a high-income country?

Your gut might say 90%, because you remember that 90% of high-income countries have high vaccination rates. But that's backwards — that's P(high vaccination | high income), not P(high income | high vaccination).

This confusion between P(A|B) and P(B|A) is one of the most common and most dangerous errors in probabilistic reasoning. It's called the base rate neglect or the confusion of the inverse, and it shows up everywhere from medical testing to criminal trials.

Let's simulate to build intuition:

# Simulate 10,000 countries
np.random.seed(42)
n = 10000

# 40% are high income
is_high_income = np.random.random(n) < 0.40

# Vaccination rate depends on income
vacc_above_80 = np.zeros(n, dtype=bool)
for i in range(n):
    if is_high_income[i]:
        vacc_above_80[i] = np.random.random() < 0.90  # 90% of high income
    else:
        vacc_above_80[i] = np.random.random() < 0.40  # 40% of others

# P(high income | vacc > 80%)
high_income_given_high_vacc = np.mean(is_high_income[vacc_above_80])
print(f"P(high income | vacc > 80%): {high_income_given_high_vacc:.4f}")

# Compare to P(vacc > 80% | high income)
high_vacc_given_high_income = np.mean(vacc_above_80[is_high_income])
print(f"P(vacc > 80% | high income): {high_vacc_given_high_income:.4f}")

print(f"\nThese are NOT the same! The direction matters.")

The Formula

Conditional probability has a simple formula:

$$P(A | B) = \frac{P(A \text{ and } B)}{P(B)}$$

In English: "To find the probability of A given that B happened, take the probability that both A and B happened, and divide by the probability of B."

Why? Because once we know B happened, B becomes our new "universe." We're zooming in on only the cases where B is true, and asking how many of those are also A.

# Verify with our simulation
p_both = np.mean(is_high_income & vacc_above_80)
p_high_vacc = np.mean(vacc_above_80)

p_conditional = p_both / p_high_vacc
print(f"P(high income AND vacc>80%): {p_both:.4f}")
print(f"P(vacc > 80%):              {p_high_vacc:.4f}")
print(f"P(high income | vacc>80%) = {p_both:.4f} / {p_high_vacc:.4f} = {p_conditional:.4f}")

20.5 Bayes' Theorem — Updating Beliefs with Evidence

Bayes' theorem is arguably the single most important formula in data science. It tells you how to update your beliefs when you get new evidence. And despite its fearsome reputation, it's just conditional probability rearranged.

The Medical Test Example

This is the classic example, and it will change how you think about test results forever.

Imagine a disease that affects 1 in 1,000 people. There's a test for it: - If you HAVE the disease, the test is positive 99% of the time (sensitivity) - If you DON'T have the disease, the test is positive 2% of the time (false positive rate)

You take the test and it comes back positive. What's the probability you actually have the disease?

Most people say "99%" or "about 98%." They're wildly wrong. Let's simulate to find out:

np.random.seed(42)
n_people = 1000000  # Simulate a million people

# 1 in 1000 have the disease (the BASE RATE)
has_disease = np.random.random(n_people) < 0.001

# Test results
test_positive = np.zeros(n_people, dtype=bool)
for i in range(n_people):
    if has_disease[i]:
        test_positive[i] = np.random.random() < 0.99  # 99% sensitivity
    else:
        test_positive[i] = np.random.random() < 0.02  # 2% false positive

# Among those who tested positive, how many actually have the disease?
actually_sick = np.sum(has_disease & test_positive)
all_positive = np.sum(test_positive)

prob_disease_given_positive = actually_sick / all_positive
print(f"People with disease who tested positive: {actually_sick}")
print(f"Total positive tests:                    {all_positive}")
print(f"P(disease | positive test):              {prob_disease_given_positive:.4f}")
print(f"That's about {prob_disease_given_positive*100:.1f}%")

The answer is roughly 4.7%. Not 99%. Not 98%. About 5%.

How can a "99% accurate" test give such a low probability? Because the disease is rare. With 1 million people, only about 1,000 have the disease, and the test correctly identifies about 990 of them. But 999,000 people DON'T have the disease, and 2% of them — about 19,980 — test positive anyway. So out of roughly 20,970 positive tests, only 990 are true positives.

This is Bayes' theorem in action. The base rate (how common the disease is) matters enormously.

The Formula

$$P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}$$

In the medical test example: - P(A) = P(disease) = 0.001 — the prior (base rate) - P(B|A) = P(positive test | disease) = 0.99 — the likelihood - P(B) = P(positive test) = P(positive|disease) * P(disease) + P(positive|no disease) * P(no disease) - P(A|B) = P(disease | positive test) — the posterior (what we want)

# Bayes' theorem calculation
p_disease = 0.001                    # Prior: base rate
p_positive_given_disease = 0.99      # Sensitivity
p_positive_given_no_disease = 0.02   # False positive rate

# P(positive test) — total probability
p_positive = (p_positive_given_disease * p_disease +
              p_positive_given_no_disease * (1 - p_disease))

# Bayes' theorem
p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive

print(f"Prior P(disease):                  {p_disease:.4f}")
print(f"P(positive | disease):             {p_positive_given_disease:.4f}")
print(f"P(positive | no disease):          {p_positive_given_no_disease:.4f}")
print(f"P(positive):                       {p_positive:.4f}")
print(f"Posterior P(disease | positive):   {p_disease_given_positive:.4f}")
print(f"                                   = {p_disease_given_positive*100:.1f}%")

Why This Matters for Data Science

Bayes' theorem isn't just a probability formula. It's a way of thinking:

  1. Start with what you already know (the prior — the base rate).
  2. Observe new evidence (the test result, the data).
  3. Update your belief (the posterior — your revised estimate).

This prior-to-posterior updating is the foundation of Bayesian statistics, which is increasingly important in data science. But even if you never use Bayesian methods explicitly, understanding Bayes' theorem will help you: - Interpret medical test results correctly - Understand why spam filters work - Avoid the base rate fallacy in everyday reasoning - Think clearly about evidence and uncertainty

Base rate neglect is the tendency to ignore the base rate (the prior) and focus only on the test's accuracy. It's one of the most well-documented cognitive biases, identified by psychologists Daniel Kahneman and Amos Tversky. When you hear "99% accurate," your brain latches onto that number and ignores how rare the condition is. Bayes' theorem is the antidote to this bias.

Bayes' Theorem as a Thinking Framework

Bayes' theorem isn't just a formula for computing conditional probabilities. It's a way of thinking about the world that has profound implications for data science and decision-making.

The Bayesian framework says: start with what you already believe (the prior), observe evidence (the data), and update your beliefs rationally (the posterior).

This is, in a very real sense, what all of data science does. You start with some understanding of the world (maybe vaccination rates are related to income). You collect data. You update your understanding based on what the data shows. Bayes' theorem provides the mathematically optimal way to do this updating.

Let's explore how the prior affects the posterior:

# How different priors lead to different conclusions
disease_rates = [0.0001, 0.001, 0.01, 0.05, 0.10, 0.20, 0.50]
sensitivity = 0.95
false_positive_rate = 0.05

print(f"Test: {sensitivity*100:.0f}% sensitivity, {false_positive_rate*100:.0f}% false positive rate")
print(f"{'Prior (base rate)':>20} {'Posterior (P(disease|+))':>30}")
print("=" * 55)

for prior in disease_rates:
    p_positive = sensitivity * prior + false_positive_rate * (1 - prior)
    posterior = (sensitivity * prior) / p_positive
    print(f"{prior:>20.4f} {posterior:>30.4f} ({posterior*100:.1f}%)")

Look at how dramatically the posterior changes with the prior! With a 0.01% base rate, a positive test gives only about 0.2% chance of disease. With a 50% base rate, it gives about 95%. The same test, the same result, but completely different conclusions depending on the context.

This is why doctors consider risk factors before ordering tests. A positive mammogram in a 70-year-old woman with a family history means something very different from the same result in a 25-year-old woman with no risk factors — because the base rates are different.

Key insight for data science: Whenever you're interpreting a result — a test statistic, a classification output, a model prediction — always ask: "What was the base rate?" A model that detects fraud with "99% accuracy" is useless if fraud occurs in only 0.01% of transactions, because the false positives will vastly outnumber the true positives.


20.6 The Law of Large Numbers — Why Casinos Always Win

The law of large numbers says that as you repeat an experiment more and more times, the average result gets closer and closer to the expected (theoretical) value.

This is why casinos are profitable: individual gamblers win and lose unpredictably, but over millions of bets, the casino's edge is guaranteed to show up. It's also why opinion polls work: any individual response is unpredictable, but the average of thousands of responses converges to something meaningful.

Let's watch this in action:

# Simulate coin flips and track the running proportion of heads
np.random.seed(42)
n_flips = 10000
flips = np.random.choice([0, 1], size=n_flips)  # 0=Tails, 1=Heads
running_proportion = np.cumsum(flips) / np.arange(1, n_flips + 1)

# Plot
fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(running_proportion, color='steelblue', linewidth=0.5)
ax.axhline(y=0.5, color='red', linestyle='--', label='True probability (0.5)')
ax.set_xlabel('Number of flips')
ax.set_ylabel('Proportion of heads')
ax.set_title('Law of Large Numbers: Coin Flips')
ax.legend()
ax.set_ylim(0.3, 0.7)

# Zoom in on first 100 flips
ax_inset = fig.add_axes([0.55, 0.55, 0.35, 0.35])
ax_inset.plot(running_proportion[:100], color='steelblue', linewidth=1)
ax_inset.axhline(y=0.5, color='red', linestyle='--')
ax_inset.set_title('First 100 flips (zoomed)', fontsize=9)
ax_inset.set_ylim(0.2, 0.8)

plt.savefig('law_of_large_numbers.png', dpi=150, bbox_inches='tight')
plt.show()

Look at that plot. In the first few flips, the proportion bounces wildly — maybe 100% heads after one flip, 50% after two, 33% after three. But as the number of flips grows, the line gets closer and closer to 0.5 and stays there.

What the Law of Large Numbers Does NOT Say

This is important. The law of large numbers does NOT say:

  • That tails is "due" after a streak of heads (gambler's fallacy)
  • That short-run results must look like long-run results
  • That the universe "balances things out"

It says that proportions converge. The difference between observed heads and expected heads can actually grow in absolute terms — you might be 50 heads ahead of expected after 10,000 flips — but the proportion converges because 50 out of 10,000 is negligible.

# Demonstrate: absolute deviation can grow even as proportion converges
np.random.seed(42)
n_flips = 100000
flips = np.random.choice([0, 1], size=n_flips)

cumulative_heads = np.cumsum(flips)
expected_heads = np.arange(1, n_flips + 1) / 2
absolute_deviation = cumulative_heads - expected_heads
proportion = cumulative_heads / np.arange(1, n_flips + 1)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(absolute_deviation, color='coral', linewidth=0.5)
axes[0].set_title('Absolute Deviation from Expected\n(Can grow!)')
axes[0].set_xlabel('Number of flips')
axes[0].set_ylabel('Actual heads − Expected heads')

axes[1].plot(proportion, color='steelblue', linewidth=0.5)
axes[1].axhline(0.5, color='red', linestyle='--')
axes[1].set_title('Proportion of Heads\n(Converges to 0.5)')
axes[1].set_xlabel('Number of flips')
axes[1].set_ylabel('Proportion')

plt.tight_layout()
plt.savefig('lln_proportion_vs_count.png', dpi=150, bbox_inches='tight')
plt.show()

The Convergence Visualized

The law of large numbers is one of those ideas that feels obvious once you see it but is easy to misunderstand. Let's look at it from another angle.

# Multiple convergence paths
np.random.seed(42)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Multiple random walks toward 0.5
for i in range(10):
    flips = np.random.choice([0, 1], size=5000)
    proportions = np.cumsum(flips) / np.arange(1, 5001)
    axes[0].plot(proportions, alpha=0.4, linewidth=0.8)

axes[0].axhline(0.5, color='red', linestyle='--', linewidth=2)
axes[0].set_title('10 Different Sequences of Coin Flips\nAll converging to 0.5')
axes[0].set_xlabel('Number of Flips')
axes[0].set_ylabel('Running Proportion of Heads')
axes[0].set_ylim(0.3, 0.7)

# Right: How fast does convergence happen?
sample_sizes = [10, 50, 100, 500, 1000, 5000, 10000]
spreads = []
for n in sample_sizes:
    proportions = [np.mean(np.random.choice([0, 1], n)) for _ in range(1000)]
    spreads.append(np.std(proportions))

axes[1].plot(sample_sizes, spreads, 'o-', color='steelblue', linewidth=2)
axes[1].set_title('How Quickly Estimates Stabilize\n(Standard deviation of proportion)')
axes[1].set_xlabel('Number of Flips')
axes[1].set_ylabel('SD of Estimated Proportion')
axes[1].set_xscale('log')

plt.tight_layout()
plt.savefig('convergence.png', dpi=150, bbox_inches='tight')
plt.show()

The left plot shows 10 different sequences of coin flips. Each one starts wobbly and eventually settles near 0.5 — but each path is different. That's the randomness at work. The right plot shows how quickly the estimates stabilize: the spread decreases as a function of 1/sqrt(n). To cut the uncertainty in half, you need four times as many observations.

This has profound practical implications. It explains why election polls survey 1,000 people instead of 100 — and why surveying 10,000 gives only marginally better results than 1,000 (diminishing returns from the square root relationship).


20.7 Expected Value — The Long-Run Average

The expected value is the average outcome you'd get if you repeated an experiment infinitely many times. It's the probability-weighted average of all possible outcomes.

# Expected value of a fair die
outcomes = [1, 2, 3, 4, 5, 6]
probabilities = [1/6] * 6

expected_value = sum(x * p for x, p in zip(outcomes, probabilities))
print(f"Expected value of a fair die: {expected_value:.2f}")

# Verify by simulation
rolls = np.random.randint(1, 7, size=1000000)
print(f"Average of 1,000,000 rolls:   {np.mean(rolls):.2f}")

The expected value of a die roll is 3.5 — a number you can never actually roll! But it's the number that the average converges to over many rolls.

Expected Value in Decision-Making

Expected value is how rational decisions are made under uncertainty.

# Should Marcus's bakery offer a "rain day" discount?
# Scenario: If it rains, discount brings in 50 extra customers ($500 revenue)
# If it doesn't rain, discount costs $200 in lost margin on regulars
# P(rain) in this city = 0.35

p_rain = 0.35
revenue_if_rain = 500
cost_if_no_rain = -200

expected_value = p_rain * revenue_if_rain + (1 - p_rain) * cost_if_no_rain
print(f"Expected value of rain discount: ${expected_value:.2f}")

if expected_value > 0:
    print("On average, the discount is profitable. Do it!")
else:
    print("On average, the discount loses money. Skip it.")

20.8 The Gambler's Fallacy and Other Probability Traps

Our brains are wired to see patterns, even in randomness. This is usually helpful (noticing that rustling in the bushes might be a predator) but it leads us astray with probability. Here are the most common traps:

Trap 1: The Gambler's Fallacy

"I've gotten heads 5 times in a row — tails must be due!"

No. Each flip is independent. The coin doesn't remember.

# Simulate: after 5 heads in a row, what's P(heads) on flip 6?
np.random.seed(42)
n_simulations = 1000000
sequences = np.random.choice([0, 1], size=(n_simulations, 6))

# Find sequences where first 5 are all heads (1)
five_heads = np.all(sequences[:, :5] == 1, axis=1)
sixth_flips = sequences[five_heads, 5]

prob_heads_after_five = np.mean(sixth_flips)
print(f"After 5 heads in a row:")
print(f"  P(heads on flip 6): {prob_heads_after_five:.4f}")
print(f"  Still 0.50 — the coin has no memory!")

Trap 2: Base Rate Neglect

We covered this with Bayes' theorem. "The test is 99% accurate, so I must have the disease!" — ignoring that the disease is rare.

Trap 3: The Hot Hand Fallacy (Maybe)

"The basketball player made the last 5 shots — they're 'hot' and more likely to make the next one!"

This is more nuanced than the gambler's fallacy. In some contexts, performance streaks might be real (skill, confidence, fatigue). In pure games of chance, they're not. The debate about the "hot hand" in basketball is one of the most famous controversies in sports statistics.

Trap 4: Confusion of AND and OR

"What are the chances that two rare events BOTH happen?" People tend to underestimate how quickly "AND" probabilities shrink (you multiply small numbers) and overestimate how slowly "OR" probabilities grow.

# The birthday problem: P(at least 2 people share a birthday) in a group of 23
def birthday_simulation(group_size, n_sims=100000):
    matches = 0
    for _ in range(n_sims):
        birthdays = np.random.randint(1, 366, size=group_size)
        if len(set(birthdays)) < group_size:  # At least one duplicate
            matches += 1
    return matches / n_sims

for size in [10, 23, 30, 50, 70]:
    prob = birthday_simulation(size)
    print(f"  Group of {size:2d}: P(shared birthday) = {prob:.3f}")

With just 23 people, there's about a 50% chance that two share a birthday! Most people guess you'd need far more. The surprise comes from underestimating how many pairs exist — with 23 people, there are 23 * 22 / 2 = 253 pairs, each of which could match.

Connecting to Real Life

The traps above aren't just abstract logical puzzles — they affect real decisions every day.

Medical decisions: A doctor tells you "this test is 95% accurate" and you test positive. Without understanding base rates, you might panic unnecessarily. With Bayesian thinking, you'd ask: "What's the base rate of this condition in people like me?" If it's 1 in 10,000, even a "95% accurate" test means you probably don't have it.

Investing: "This stock has gone up for 8 months in a row — it's due for a correction." That's the gambler's fallacy applied to markets. While mean reversion is a real phenomenon in some financial contexts, each month's return is influenced by current conditions, not by the need to "balance out" past returns.

Sports: "The team is on a 5-game winning streak — they must lose eventually." Gambler's fallacy again. Streaks in sports can reflect genuine skill differences, momentum, schedule difficulty, or plain luck — but the universe doesn't keep a ledger.

Criminal justice: "The DNA match probability is 1 in a billion — the suspect must be guilty." This is the prosecutor's fallacy in its most common modern form. With databases containing millions of profiles, even a 1-in-a-billion match probability could produce false hits. (We'll explore this deeply in Case Study 1.)

# Quick demonstration: DNA database false positives
# If you search a database of 10 million profiles for a match,
# and each has a 1 in 1 billion chance of a random match...

database_size = 10_000_000
p_random_match = 1 / 1_000_000_000

# P(at least one false match) = 1 - P(no false matches)
p_no_false_match = (1 - p_random_match) ** database_size
p_at_least_one = 1 - p_no_false_match

print(f"Database size: {database_size:,}")
print(f"P(random match per profile): {p_random_match}")
print(f"P(at least one false match in database): {p_at_least_one:.4f}")
print(f"That's about {p_at_least_one*100:.2f}% — small but not zero!")

20.9 Simulation as a Problem-Solving Strategy

One of the most powerful lessons from this chapter is that simulation is a general-purpose problem-solving tool. When the math gets complicated, you can often just simulate the answer.

Here's a problem that would be hard to solve with formulas but easy with simulation:

The Monty Hall Problem: You're on a game show. There are 3 doors. Behind one is a car; behind the other two are goats. You pick door 1. The host (who knows where the car is) opens door 3, showing a goat. Should you switch to door 2 or stick with door 1?

def monty_hall_simulation(n_games=100000):
    """Simulate the Monty Hall problem."""
    switch_wins = 0
    stay_wins = 0

    for _ in range(n_games):
        # Car is behind a random door (0, 1, or 2)
        car = np.random.randint(0, 3)
        # Player picks door 0
        pick = 0

        # Host opens a door that's not the player's pick and not the car
        available = [d for d in range(3) if d != pick and d != car]
        host_opens = np.random.choice(available)

        # Switch: pick the remaining door
        switch_to = [d for d in range(3) if d != pick and d != host_opens][0]

        if pick == car:
            stay_wins += 1
        if switch_to == car:
            switch_wins += 1

    print(f"Stay wins:   {stay_wins/n_games:.4f} ({stay_wins/n_games*100:.1f}%)")
    print(f"Switch wins: {switch_wins/n_games:.4f} ({switch_wins/n_games*100:.1f}%)")

monty_hall_simulation()

Switching wins about 2/3 of the time! This result is famously counterintuitive — even professional mathematicians have argued about it — but simulation makes it undeniable.

When to Simulate vs. When to Calculate

You might wonder: if I can always simulate, why bother with formulas at all?

Good question. Here's the tradeoff:

Simulation is best when: - The problem is complex and hard to solve analytically - You want to check your analytical answer - You need to explore "what if" scenarios - You want to build intuition before learning the formula

Formulas are best when: - The answer needs to be exact (not approximate) - You need to compute the answer quickly (simulation takes time) - You want to understand why the answer is what it is (formulas reveal structure) - You need to communicate the logic to others (formulas are more transparent than code)

The best approach is to use both. Derive the answer analytically if you can, then simulate to verify. If you can't derive it, simulate — but run enough trials to be confident.

# Simulation accuracy depends on number of trials
for n_trials in [100, 1000, 10000, 100000, 1000000]:
    results = np.random.randint(1, 7, size=(n_trials, 2))
    p_seven = np.mean(results.sum(axis=1) == 7)
    error = abs(p_seven - 6/36)
    print(f"  {n_trials:>8,} trials: P(sum=7) = {p_seven:.4f} "
          f"(error: {error:.4f})")

More trials = more accuracy, but diminishing returns. For most practical purposes, 100,000 trials gives you three or four decimal places of accuracy.


20.10 The Progressive Project: Sampling Variability

Time to apply probability thinking to our Global Health Data Explorer. Elena wants to understand sampling variability — how much would her statistics change if she had data from a different random subset of countries?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Simulate the "full population" of country vaccination rates
np.random.seed(20)
all_countries = np.clip(np.random.normal(72, 18, 200), 10, 99)

population_mean = np.mean(all_countries)
population_median = np.median(all_countries)
print(f"True population mean: {population_mean:.1f}%")
print(f"True population median: {population_median:.1f}%")

# What happens when we take random samples?
sample_sizes = [10, 30, 50, 100]
n_samples = 1000

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

for ax, n in zip(axes.flat, sample_sizes):
    sample_means = [np.mean(np.random.choice(all_countries, n, replace=False))
                    for _ in range(n_samples)]

    ax.hist(sample_means, bins=30, color='steelblue', edgecolor='white', alpha=0.8)
    ax.axvline(population_mean, color='red', linestyle='--',
               label=f'True mean: {population_mean:.1f}')
    ax.axvline(np.mean(sample_means), color='orange', linestyle=':',
               label=f'Avg of samples: {np.mean(sample_means):.1f}')
    ax.set_title(f'Sample size n={n}\n(std of means: {np.std(sample_means):.1f})')
    ax.set_xlabel('Sample Mean Vaccination Rate (%)')
    ax.legend(fontsize=7)

plt.suptitle('Sampling Variability: How Sample Means Vary', fontsize=14)
plt.tight_layout()
plt.savefig('sampling_variability.png', dpi=150, bbox_inches='tight')
plt.show()

# Key insight: spread decreases with sample size
print("\nSpread of sample means (standard error):")
for n in sample_sizes:
    sample_means = [np.mean(np.random.choice(all_countries, n, replace=False))
                    for _ in range(n_samples)]
    print(f"  n={n:3d}: std of sample means = {np.std(sample_means):.2f}")

Elena's key insight: Larger samples give more stable estimates. With samples of 10 countries, the sample mean might be anywhere from 60% to 85%. With samples of 100, it's consistently between 70% and 75%. This is the foundation for everything we'll do in Chapters 22-23 (confidence intervals and hypothesis testing).


20.11 Chapter Summary

Let's consolidate what we've learned:

Probability basics: - Probability is a number between 0 (impossible) and 1 (certain) - Three interpretations: classical (counting outcomes), frequentist (long-run proportion), subjective (degree of belief)

Simulation is your Swiss Army knife: - Generate random outcomes, count successes, divide by total - Works for problems that would be difficult to solve with formulas - Monte Carlo simulation is used throughout professional data science

The rules: - Complement: P(not A) = 1 - P(A) - Addition (OR): P(A or B) = P(A) + P(B) - P(A and B) - Multiplication (AND, independent): P(A and B) = P(A) * P(B) - Conditional: P(A|B) = P(A and B) / P(B)

Bayes' theorem tells you how to update beliefs with evidence: - Start with the prior (base rate) - Observe evidence (data) - Compute the posterior (updated belief) - The base rate matters enormously — ignoring it leads to wildly wrong conclusions

The law of large numbers: Proportions converge to true probabilities as sample size grows. This is why casinos win, why polls work, and why larger samples give better estimates.

Your brain's probability bugs: - Gambler's fallacy (thinking past outcomes affect future independent events) - Base rate neglect (ignoring priors) - Underestimating AND probabilities and OR probabilities (birthday problem)


Looking Ahead

You now understand the logic of uncertainty. In the next chapter, we'll build on this foundation by exploring distributions — the mathematical shapes that describe how probabilities are spread across outcomes. We'll meet the normal curve (the bell curve), learn why it shows up everywhere, and discover the Central Limit Theorem — the reason that your sampling variability simulation at the end of this chapter produced bell-shaped histograms. That's not a coincidence. It's one of the deepest results in all of mathematics.


Connections to Previous and Future Chapters

Concept from This Chapter Where It Came From Where It's Going
Simulation with np.random Built on Python loops from Ch. 3-4 Core technique in Ch. 21 (CLT simulation)
Conditional probability New in this chapter Foundation for Bayes classifiers (Ch. 27)
Bayes' theorem New in this chapter Bayesian thinking throughout ML chapters
Law of large numbers New in this chapter Justification for sampling and polling (Ch. 22)
Base rates New in this chapter Critical in model evaluation (Ch. 29)
Sampling variability Built on descriptive stats from Ch. 19 Core concept in confidence intervals (Ch. 22)
Expected value New in this chapter Used in decision-making and loss functions (Ch. 26)