Chapter 8: Probability: The Foundation of Inference

Contributors

34 min read

> "The theory of probabilities is at bottom nothing but common sense reduced to calculus."

Prerequisites

1
2
5
6
Basic arithmetic (fractions, decimals, percentages)

Learning Objectives

Define probability using the classical, relative frequency, and subjective approaches
Apply the addition rule for mutually exclusive and non-mutually exclusive events
Apply the multiplication rule for independent events
Calculate complementary probabilities
Construct and interpret two-way (contingency) tables for probability

In This Chapter

Chapter Overview
8.1 A Puzzle Before We Start (Productive Struggle)
8.2 What Is Probability?
8.3 The Law of Large Numbers: Why More Data Means Better Estimates
8.4 Basic Probability Rules: Building Your Toolkit
8.5 The Addition Rule: "Or" Probabilities
8.6 The Multiplication Rule: "And" Probabilities
8.7 Two-Way Tables: Probability from Data
8.8 Putting It All Together: Worked Examples
8.9 The Birthday Problem Revisited
8.10 Probability and AI: Speaking the Language of Uncertainty
8.11 Progressive Project Checkpoint: Your Dataset's Contingency Tables
8.12 Chapter Summary
Key Formulas at a Glance

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 8: Probability: The Foundation of Inference

"The theory of probabilities is at bottom nothing but common sense reduced to calculus." — Pierre-Simon Laplace, Théorie analytique des probabilités (1812)

Chapter Overview

Here's a question that will change how you think about the rest of this course: What does it mean to say there's a 30% chance of rain tomorrow?

It's not like 30% of the sky is going to rain. It doesn't mean it'll rain for 30% of the day. And it definitely doesn't mean the meteorologist is 30% sure and 70% clueless. So what does it mean?

That question — simple on the surface, surprisingly deep underneath — is what this entire chapter is about. We're entering the world of probability, and I want to be upfront with you: this is where statistics transitions from describing data you have to reasoning about data you don't have. Everything you've learned so far — graphs, summaries, data cleaning — was about making sense of the data sitting in front of you. Probability is about making sense of uncertainty itself.

And here's the thing: uncertainty is not the enemy. It's not a sign that something went wrong. Uncertainty is the raw material of every statistical conclusion you'll ever draw. Every confidence interval (Chapter 12), every hypothesis test (Chapter 13), every prediction (Chapter 22) is built on probability. If descriptive statistics is the language of what is, probability is the language of what could be.

So yes, this chapter matters. A lot. But I've also got good news: the basic rules of probability are surprisingly few, surprisingly intuitive, and once they click, they unlock everything that follows.

In this chapter, you will learn to: - Define probability using the classical, relative frequency, and subjective approaches - Apply the addition rule for mutually exclusive and non-mutually exclusive events - Apply the multiplication rule for independent events - Calculate complementary probabilities - Construct and interpret two-way (contingency) tables for probability

Fast Track: If you're comfortable with basic probability rules and can explain the difference between mutually exclusive and independent events, skim Sections 8.1-8.3 and jump to Section 8.7 (Contingency Tables). Complete quiz questions 1, 10, and 17 to verify your foundation.

Deep Dive: After this chapter, read Case Study 1 (the Monty Hall Problem) for a brain-bending exercise in probability intuition, then Case Study 2 (probability in sports) to see how these rules drive real-world predictions and billion-dollar industries.

8.1 A Puzzle Before We Start (Productive Struggle)

Before I teach you any rules, I want you to wrestle with a problem. Don't skip this. The struggle is the point.

The Birthday Puzzle

How many people do you need in a room before there's a better-than-even chance (greater than 50%) that at least two people share the same birthday?

Take a guess. Write it down. Don't overthink it — just go with your gut.

Most people guess somewhere around 180 or 183 (half of 365). That feels right, doesn't it? If there are 365 possible birthdays, you'd need about half that many people to have a 50-50 shot.

But the real answer is 23. Just 23 people.

If that surprises you — good. It surprises almost everyone. It's called the birthday paradox, and it's not actually a paradox at all. It just feels like one because our intuitions about probability are spectacularly unreliable.

Here's the key insight: with 23 people, you're not checking whether one specific person shares your birthday. You're checking whether any pair among all 23 people shares a birthday. And there are a lot of pairs — 253 of them, to be exact ($\binom{23}{2} = 253$). Each pair is a separate chance for a match.

We'll return to this puzzle at the end of the chapter, once you have the tools to prove the answer mathematically. For now, let it sit with you as a reminder: your intuition about probability is often wrong, which is precisely why you need formal rules.

8.2 What Is Probability?

Let's start with the basics. What do we actually mean by "probability"?

Key Concept: Probability

Probability is a number between 0 and 1 (or equivalently, between 0% and 100%) that measures how likely an event is to occur. - A probability of 0 means the event is impossible. - A probability of 1 means the event is certain. - Everything else falls somewhere in between.

That definition is easy enough. The interesting question is: where do the numbers come from? It turns out there are three different approaches, and each one is useful in different situations.

The Classical Approach

The classical approach to probability works when every outcome is equally likely. It's the oldest and most intuitive approach — the kind of probability you might have encountered in a math class.

$$P(A) = \frac{\text{Number of outcomes favorable to event } A}{\text{Total number of equally likely outcomes}}$$

Example: Rolling a Die

What's the probability of rolling a 4 on a fair six-sided die?

There are 6 equally likely outcomes: {1, 2, 3, 4, 5, 6}
Exactly 1 outcome is favorable (rolling a 4)
$P(\text{rolling a 4}) = \frac{1}{6} \approx 0.167$ or about 16.7%

What's the probability of rolling an even number?

Favorable outcomes: {2, 4, 6} — that's 3 outcomes
$P(\text{even}) = \frac{3}{6} = \frac{1}{2} = 0.50$ or 50%

Before we go further, let's nail down some vocabulary.

Key Terms: The Building Blocks

Outcome: A single result of a random process. When you roll a die, "rolling a 3" is an outcome.

Sample space: The set of all possible outcomes. For a die, the sample space is {1, 2, 3, 4, 5, 6}. We often write it as S.

Event: A collection of one or more outcomes you're interested in. "Rolling an even number" is an event that contains three outcomes: {2, 4, 6}. We typically label events with capital letters like A, B, C.

Let me bring in our anchor examples. Sam Okafor is back at the Riverside Raptors, and he's wondering: if a player takes a shot, what's the probability it goes in? The classical approach doesn't help much here — basketball shots aren't like dice rolls. The outcomes (make or miss) aren't equally likely, and they depend on the player, the distance, the defense, fatigue, and a hundred other factors.

That's where the second approach comes in.

The Relative Frequency Approach

The relative frequency approach defines probability as the proportion of times an event occurs over many, many repetitions.

$$P(A) = \frac{\text{Number of times } A \text{ occurred}}{\text{Total number of trials}}$$

Spaced Review 1: Relative Frequency (Ch. 5/6)

Remember relative frequency from Chapter 5? When you built histograms, you calculated what fraction of observations fell into each bin. The relative frequency of a bin was (count in bin) ÷ (total count). That's exactly the same idea. In Chapter 5, you were describing data you had. Now, you're using that same calculation to estimate the probability of future events. The math is identical — the interpretation shifts from "what happened" to "what's likely to happen."

Example: Sam's Shooting Data

Sam has tracked one of his players, Daria Williams, over the entire season. In 500 shots, she made 215. What's the probability she makes her next shot?

$$P(\text{make}) = \frac{215}{500} = 0.430 \text{ or } 43.0\%$$

This is an estimate of the true probability. It's based on past performance, and it assumes that the conditions remain roughly similar. If Daria is exhausted in overtime or playing against a much stronger defense, this estimate might not hold. But 500 shots is a pretty good sample.

The relative frequency approach is powerful because it works even when outcomes aren't equally likely. But it has a requirement: you need data. Lots of it.

The Subjective Approach

What about situations where you can't repeat the event at all? What's the probability that a specific company's stock price doubles this year? What's the probability that it rains on your wedding day? What's the probability that a new drug receives FDA approval?

These are one-time events. You can't roll them like a die or repeat them 500 times. For situations like these, we use the subjective approach: probability as a person's degree of belief, informed by evidence, experience, and judgment.

Example: Dr. Maya Chen's Assessment

Maya is monitoring a new respiratory virus. Based on its genetic profile, transmission patterns in other countries, and vaccination coverage in her county, she estimates there's a 15% probability of a significant outbreak in her region this winter. That 15% isn't derived from a formula with equally likely outcomes. It isn't from repeating the same winter 100 times. It's her expert assessment — her best quantification of uncertainty given what she knows.

Is the subjective approach less "scientific" than the other two? Not necessarily. It's the approach used in Bayesian statistics (coming in Chapter 9), weather forecasting, medical diagnosis, and artificial intelligence. It's also the approach that every investor, doctor, and policymaker uses daily — whether they call it probability or not.

Comparing the Three Approaches

Approach	How It Works	When to Use It	Example
Classical	Count equally likely outcomes	Games of chance, simple random processes	Probability of drawing a heart from a deck
Relative Frequency	Use data from repeated trials	When you have historical data or can run experiments	Batting average, defect rate on assembly line
Subjective	Expert judgment and available evidence	One-time events, complex predictions	Probability of an earthquake, election forecasts

All three approaches follow the same mathematical rules. That's the beautiful thing about probability theory — the rules don't care where your numbers came from. Once you have a probability, the machinery works the same way whether it was calculated from a formula, estimated from data, or assessed by an expert.

Theme Connection: AI Uses Probability Constantly (Theme 3)

Here's something worth knowing: AI and machine learning systems use all three approaches, but they especially rely on the relative frequency and subjective approaches. When a spam filter estimates the probability that an email is spam based on its word content, it's using relative frequencies from millions of previously labeled emails. When a self-driving car estimates the probability that a pedestrian will step into the road, it combines sensor data (relative frequency of past pedestrian behavior) with contextual judgment (it's near a crosswalk, the light just changed). Probability isn't just a chapter in a statistics textbook — it's the language that AI speaks.

8.3 The Law of Large Numbers: Why More Data Means Better Estimates

If you flip a fair coin 10 times, you might get 7 heads and 3 tails. That's 70% heads — way off from the "true" probability of 50%. Does that mean the coin is unfair?

No. It means 10 flips isn't enough to see the pattern clearly.

But if you flip it 100 times, you'll probably get somewhere between 40 and 60 heads. Flip it 1,000 times, and you'll likely see between 470 and 530 heads. Flip it 10,000 times, and the proportion of heads will be very, very close to 0.50.

This is the law of large numbers, and it's one of the most important ideas in all of statistics.

Key Concept: The Law of Large Numbers

As the number of trials increases, the relative frequency of an event gets closer and closer to its true probability.

In mathematical notation: as $n \to \infty$, the observed proportion $\hat{p} \to P(A)$.

The law of large numbers is why casinos always win in the long run. Any individual gambler might get lucky on a given night. But the casino plays millions of hands, spins millions of wheels, and accepts millions of bets. Over that many trials, the house edge — even if it's just 2% or 3% — grinds out a profit with near-mathematical certainty.

It's also why Sam can trust a shooting percentage based on 500 shots more than one based on 5 shots. Five shots is a tiny sample — Daria could easily make 4 out of 5 (80%) or 1 out of 5 (20%) just by chance. But 500 shots? The law of large numbers says that 43.0% is a reliable estimate.

Seeing It in Action: A Simulation

Let's watch the law of large numbers come alive. We'll simulate coin flips in Python and track how the proportion of heads evolves as we flip more and more coins.

import numpy as np
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Simulate 10,000 coin flips (1 = heads, 0 = tails)
n_flips = 10000
flips = np.random.choice([0, 1], size=n_flips)

# Calculate the running proportion of heads
cumulative_heads = np.cumsum(flips)
flip_numbers = np.arange(1, n_flips + 1)
running_proportion = cumulative_heads / flip_numbers

# Plot
plt.figure(figsize=(10, 5))
plt.plot(flip_numbers, running_proportion, color='steelblue', linewidth=0.8)
plt.axhline(y=0.5, color='red', linestyle='--', linewidth=1.5, label='True probability (0.50)')
plt.xlabel('Number of Flips', fontsize=12)
plt.ylabel('Proportion of Heads', fontsize=12)
plt.title('Law of Large Numbers: Coin Flip Simulation', fontsize=14)
plt.legend(fontsize=11)
plt.ylim(0.3, 0.7)
plt.tight_layout()
plt.show()

# Print some checkpoints
for n in [10, 50, 100, 500, 1000, 5000, 10000]:
    prop = cumulative_heads[n-1] / n
    print(f"After {n:>5} flips: proportion of heads = {prop:.4f}")

What to expect: The plot shows wild swings early on — after 10 flips, the proportion might be 0.30 or 0.70 or anything. But as the number of flips increases, the line gradually settles down and hugs the true probability of 0.50. By 10,000 flips, it's very close.

This is the law of large numbers in a picture. The more data you have, the more reliable your probability estimates become.

The Gambler's Fallacy: What the Law of Large Numbers Does Not Say

Common Misconception: The Gambler's Fallacy

A roulette wheel has landed on red five times in a row. A gambler thinks, "Black is due. It has to come up soon to balance things out."

This is wrong. The gambler's fallacy is the mistaken belief that past random events influence future ones. Each spin of the roulette wheel is independent — the ball has no memory of where it landed before. The probability of black on the next spin is exactly the same as it always is, regardless of what happened on the previous five spins.

The law of large numbers says that over thousands of spins, the proportion will approach the true probability. It does NOT say the universe needs to "correct" itself after a streak. The evening out happens because future results dilute past streaks — not because the wheel compensates for them.

Let's simulate this to see the fallacy in action:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(123)

# Simulate 1000 coin flips
flips = np.random.choice(['H', 'T'], size=1000)

# Find streaks of 5+ heads and check what comes next
streak_length = 5
results_after_streak = []

for i in range(len(flips) - streak_length):
    # Check if flips[i] through flips[i+streak_length-1] are all heads
    if all(flips[i:i+streak_length] == 'H'):
        if i + streak_length < len(flips):
            results_after_streak.append(flips[i + streak_length])

if len(results_after_streak) > 0:
    heads_after = sum(1 for r in results_after_streak if r == 'H')
    total_after = len(results_after_streak)
    print(f"After a streak of {streak_length} heads:")
    print(f"  Next flip was heads: {heads_after}/{total_after} "
          f"({heads_after/total_after:.1%})")
    print(f"  Next flip was tails: {total_after - heads_after}/{total_after} "
          f"({(total_after - heads_after)/total_after:.1%})")
    print(f"\nThe coin doesn't 'remember' its streak.")
    print(f"The probability is still roughly 50/50.")
else:
    print(f"No streaks of {streak_length} heads found in this simulation.")

What this shows: Even after a streak of five heads in a row, the next flip is still approximately 50/50. The coin doesn't "owe" you tails.

Threshold Concept: Probability as Long-Run Frequency

Here's the conceptual shift this chapter asks you to make: stop thinking about individual events and start thinking about patterns over many events.

When we say "the probability of heads is 0.50," we're NOT saying this particular flip will be heads half the time (that doesn't even make sense — a single flip is either heads or tails, period). We're saying that if you flip the coin thousands of times, about half will be heads.

This is a fundamentally different way of thinking. Instead of asking "What WILL happen?" you're asking "What TENDS to happen?" Instead of certainty, you're working with tendencies. Instead of individual outcomes, you're reasoning about long-run behavior.

This shift — from certainty to probabilistic thinking — is one of the most important intellectual moves you'll make in this course. It's what makes statistical inference possible. And it's exactly how Sam needs to think about Daria's shooting: not "Will she make the next shot?" but "Over many shots in similar conditions, what fraction will she make?"

If this feels uncomfortable, good. Sit with it. Let it marinate. By the end of Part 3, this way of thinking will feel like second nature.

8.4 Basic Probability Rules: Building Your Toolkit

Now that you know what probability is, let's learn the rules for calculating it. There are really only a few core rules, and they'll carry you through the rest of this course.

Rule 1: Probabilities Must Be Between 0 and 1

$$0 \leq P(A) \leq 1 \text{ for any event } A$$

If someone tells you the probability of something is -0.3 or 1.4, they've made an error. Full stop.

Rule 2: The Probabilities of All Outcomes Must Sum to 1

$$\sum_{i} P(\text{outcome}_i) = 1$$

If a die has six faces, the probabilities of rolling 1, 2, 3, 4, 5, and 6 must add up to exactly 1. Something has to happen.

Rule 3: The Complement Rule

This is one of the most useful rules in probability, and it's beautifully simple.

Key Concept: Complement

The complement of an event A (written $A'$ or $A^c$ or $\bar{A}$) is the event that A does NOT occur. It includes all outcomes in the sample space that are NOT in A.

$$P(A') = 1 - P(A)$$

Why it's useful: Sometimes it's much easier to calculate the probability of something not happening than the probability of it happening. You just calculate the easy one and subtract from 1.

Mathematical Formulation: The Complement Rule

$$\boxed{P(\text{not } A) = 1 - P(A)}$$

Equivalently: $P(A) + P(A') = 1$

In words: The probability that something happens plus the probability that it doesn't happen always equals 1. Something has to happen, and those are the only two options.

Example: Professor Washington's Risk Assessment

Professor Washington is studying a predictive policing algorithm that assigns risk scores to neighborhoods. A particular neighborhood has a 0.12 probability of being flagged as "high risk" on any given day. What's the probability it's NOT flagged?

$$P(\text{not flagged}) = 1 - P(\text{flagged}) = 1 - 0.12 = 0.88$$

There's an 88% chance the neighborhood is not flagged on any given day. Simple — but powerful.

Example: The Birthday Puzzle Preview

Remember the birthday puzzle? Here's why the complement rule makes it solvable. Calculating the probability that at least two people share a birthday directly is a nightmare — you'd have to consider every possible pair, triple, and group. But calculating the probability that nobody shares a birthday? That's much more manageable. Then you just subtract from 1.

$$P(\text{at least one match}) = 1 - P(\text{no matches at all})$$

We'll work through the full calculation in Section 8.9.

8.5 The Addition Rule: "Or" Probabilities

What's the probability that event A or event B occurs (or both)?

Before I give you the formula, I need to introduce an important distinction.

Key Terms: Mutually Exclusive Events

Two events are mutually exclusive (also called disjoint) if they cannot both happen at the same time. If one occurs, the other automatically cannot.

Rolling a 2 AND rolling a 5 on a single die roll? Mutually exclusive — you can't roll two numbers at once.

Drawing a heart AND drawing a spade from a single draw? Mutually exclusive — a card can't be both suits.

Being left-handed AND being right-handed? Mutually exclusive (for most people).

NOT mutually exclusive: - Drawing a heart AND drawing a queen? NOT mutually exclusive — the Queen of Hearts is both. - Being a smoker AND having asthma? NOT mutually exclusive — a person can be both.

The Addition Rule for Mutually Exclusive Events

If events A and B are mutually exclusive, the probability of A or B is simply the sum of their individual probabilities:

$$P(A \text{ or } B) = P(A) + P(B) \quad \text{(if } A \text{ and } B \text{ are mutually exclusive)}$$

Example: Rolling a Die

What's the probability of rolling a 2 or a 5?

These events are mutually exclusive (you can't roll both at once), so:

$$P(2 \text{ or } 5) = P(2) + P(5) = \frac{1}{6} + \frac{1}{6} = \frac{2}{6} = \frac{1}{3} \approx 0.333$$

The General Addition Rule

But what if the events AREN'T mutually exclusive? What if they can overlap?

Here's the intuition. Imagine you're asking: "What's the probability of drawing a heart or a queen from a standard deck?"

$P(\text{heart}) = \frac{13}{52}$ (13 hearts in a deck)
$P(\text{queen}) = \frac{4}{52}$ (4 queens in a deck)

If you just add those: $\frac{13}{52} + \frac{4}{52} = \frac{17}{52}$. But wait — the Queen of Hearts got counted twice. Once as a heart, and once as a queen. You need to subtract the overlap.

Mathematical Formulation: The General Addition Rule

$$\boxed{P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B)}$$

In words: To find the probability of A or B (or both), add their individual probabilities, then subtract the probability that both occur (to avoid double-counting the overlap).

Special case: If A and B are mutually exclusive, then $P(A \text{ and } B) = 0$, and the formula simplifies to $P(A \text{ or } B) = P(A) + P(B)$.

Example: Hearts or Queens

$$P(\text{heart or queen}) = P(\text{heart}) + P(\text{queen}) - P(\text{heart and queen})$$ $$= \frac{13}{52} + \frac{4}{52} - \frac{1}{52} = \frac{16}{52} = \frac{4}{13} \approx 0.308$$

There are 16 cards that are either hearts or queens (or both): the 13 hearts plus the 3 non-heart queens. That checks out.

Example: Alex's StreamVibe Data

Alex is analyzing user behavior on StreamVibe. He finds that among 1,000 users: - 380 watched a comedy in the past week - 290 watched a documentary - 85 watched both a comedy AND a documentary

What's the probability that a randomly selected user watched a comedy or a documentary (or both)?

$$P(\text{comedy or documentary}) = \frac{380}{1000} + \frac{290}{1000} - \frac{85}{1000} = \frac{585}{1000} = 0.585$$

There's a 58.5% probability that a random user watched at least one of these two genres. Notice that if Alex had just added 380 + 290 = 670, he'd overcount by 85 users (the ones who watched both).

Visualizing the Addition Rule

Think of a Venn diagram. Two overlapping circles, one for event A and one for event B. The "or" probability covers everything inside either circle. If you add the area of circle A to the area of circle B, you've counted the overlap region twice. So you subtract it once to get the correct total.

     ┌─────────────────────────────┐
     │          Sample Space       │
     │                             │
     │    ┌───────┐  ┌───────┐    │
     │    │       │  │       │    │
     │    │   A   │AB│   B   │    │
     │    │       │  │       │    │
     │    └───────┘  └───────┘    │
     │                             │
     └─────────────────────────────┘

P(A or B) = P(A) + P(B) - P(A and B)
            [all of A] + [all of B] - [overlap counted twice]

8.6 The Multiplication Rule: "And" Probabilities

What's the probability that event A and event B both occur?

This depends on whether the events are independent — one of the most important concepts in probability.

Key Concept: Independent Events

Two events are independent if knowing that one occurred doesn't change the probability of the other occurring. In other words, they don't influence each other.

Flipping a coin and rolling a die? Independent — the coin doesn't care what the die shows.

The weather today and your exam grade? Independent (unless you skip the exam because of a storm).

Drawing two cards from a deck with replacement (putting the first card back)? Independent.

Drawing two cards without replacement? NOT independent — removing the first card changes the probabilities for the second draw.

The Multiplication Rule for Independent Events

If events A and B are independent, the probability of both occurring is simply the product of their individual probabilities:

Mathematical Formulation: The Multiplication Rule (Independent Events)

$$\boxed{P(A \text{ and } B) = P(A) \times P(B)} \quad \text{(if } A \text{ and } B \text{ are independent)}$$

In words: For independent events, multiply the individual probabilities.

Example: Two Coin Flips

What's the probability of getting heads on two consecutive fair coin flips?

The flips are independent (the first flip doesn't affect the second), so:

$$P(\text{H and H}) = P(\text{H}) \times P(\text{H}) = 0.5 \times 0.5 = 0.25$$

There's a 25% chance of two heads in a row. You can verify this by listing the sample space: {HH, HT, TH, TT} — four equally likely outcomes, one of which is HH.

Example: Sam's Independent Events

Two Riverside Raptors players take free throws. Player A makes free throws 80% of the time. Player B makes them 70% of the time. Assuming their shots are independent (one player's performance doesn't affect the other's), what's the probability both players make their free throws?

$$P(\text{both make it}) = 0.80 \times 0.70 = 0.56$$

There's a 56% probability that both players sink their free throws.

What about the probability that at least one misses? Here's where the complement rule meets the multiplication rule:

$$P(\text{at least one miss}) = 1 - P(\text{both make it}) = 1 - 0.56 = 0.44$$

There's a 44% chance that at least one of them misses. Notice how the complement made this easy — calculating "at least one miss" directly would require considering three separate cases (A misses and B makes, A makes and B misses, both miss).

Example: The Power of Multiplication — Maya's Disease Screening

Maya's county uses a screening test for a respiratory disease. The test has a 2% false positive rate, meaning there's a 0.02 probability that a healthy person tests positive. If 3 healthy people are tested independently, what's the probability that ALL three test positive?

$$P(\text{all 3 false positives}) = 0.02 \times 0.02 \times 0.02 = 0.000008$$

That's 8 in a million — extremely unlikely. The multiplication rule shows how probabilities of independent events shrink rapidly when multiplied together. This is why multiple lines of evidence are so compelling: the probability of getting the same wrong answer repeatedly, by pure chance, becomes vanishingly small.

Spaced Review 2: Randomization (Ch. 4)

In Chapter 4, you learned that randomization is the gold standard for experiments — it protects against bias by ensuring that treatment and control groups are similar. Here's the probability connection: randomization works because it makes group assignment independent of all other variables. When you randomly assign patients to treatment or control, their assignment is independent of their age, health, genetics, diet — everything. That independence is what allows us to attribute differences between groups to the treatment rather than to confounders. Randomization is an application of the multiplication rule in disguise.

A Critical Distinction: Mutually Exclusive ≠ Independent

Students frequently confuse these two concepts. Let me be blunt: they are completely different ideas.

	Mutually Exclusive	Independent
Question asked	Can A and B happen at the same time?	Does knowing about A change the probability of B?
If YES	$P(A \text{ and } B) = 0$	$P(A \text{ and } B) = P(A) \times P(B)$
Relationship	About overlap	About influence
Example	Rolling a 2 or a 5 on one die	Rolling a 2 on one die and flipping heads on a coin
Can they coexist?	No — if events are mutually exclusive with non-zero probabilities, they CANNOT be independent	Only if at least one has probability 0

That last row is surprising: if two events are mutually exclusive (they can't both happen), they're actually dependent. Why? Because if I tell you event A occurred, you immediately know event B did NOT occur. Knowing about A changed your knowledge of B — that's dependence.

8.7 Two-Way Tables: Probability from Data

So far, our examples have been fairly abstract — coins, dice, cards. Let's bring probability back to data.

Spaced Review 3: Categorical vs. Numerical (Ch. 2)

In Chapter 2, you learned the difference between categorical variables (categories or labels) and numerical variables (numbers with meaningful arithmetic). Contingency tables — the topic of this section — are built from two categorical variables. They show how many observations fall into each combination of categories. Remember: categorical variables classify observations into groups. That classification is exactly what makes probability calculations from contingency tables possible.

A contingency table (also called a two-way table or cross-tabulation) displays the frequency of observations for every combination of two categorical variables.

Let's build one with real data.

Example: Maya's Health Study

Maya is examining the relationship between smoking status and respiratory illness in a sample of 500 adults from her county.

	Has Respiratory Illness	No Respiratory Illness	Row Total
Smoker	68	92	160
Non-Smoker	42	298	340
Column Total	110	390	500

This table tells a story. There are 500 people total. 160 are smokers, 340 are non-smokers. 110 have a respiratory illness, 390 don't. And the four interior cells show every combination: 68 people are both smokers and have a respiratory illness, 92 are smokers without respiratory illness, and so on.

Let's calculate some probabilities.

Marginal Probability

A marginal probability is the probability of a single event, calculated from the row or column totals (the "margins" of the table).

$$P(\text{smoker}) = \frac{160}{500} = 0.32$$

$$P(\text{respiratory illness}) = \frac{110}{500} = 0.22$$

Joint Probability

A joint probability is the probability of two events occurring together — the probability of being in a specific cell of the table.

Key Term: Joint Probability

Joint probability is the probability that two events occur simultaneously. In a contingency table, it's the cell count divided by the grand total.

$$P(\text{smoker and respiratory illness}) = \frac{68}{500} = 0.136$$

$$P(\text{non-smoker and no respiratory illness}) = \frac{298}{500} = 0.596$$

Using the Table to Verify the Addition Rule

What's the probability that a randomly selected person is a smoker OR has a respiratory illness (or both)?

Using the addition rule:

$$P(\text{smoker or illness}) = P(\text{smoker}) + P(\text{illness}) - P(\text{smoker and illness})$$ $$= \frac{160}{500} + \frac{110}{500} - \frac{68}{500} = \frac{202}{500} = 0.404$$

You can verify this by counting directly: 92 (smoker, no illness) + 68 (smoker, illness) + 42 (non-smoker, illness) = 202 people are either smokers, have respiratory illness, or both. $\frac{202}{500} = 0.404$. ✓

Example: Alex's User Segments

Alex creates a contingency table from StreamVibe's user data, crossing subscription plan with device type.

	Mobile	Desktop	Smart TV	Row Total
Free Plan	245	180	75	500
Basic Plan	120	150	130	400
Premium Plan	35	70	195	300
Column Total	400	400	400	1,200

Let's calculate:

Joint probability: What's the probability a random user is on the Premium plan AND uses a Smart TV?

$$P(\text{premium and Smart TV}) = \frac{195}{1200} = 0.1625$$

Marginal probability: What's the probability a random user is on the Free plan?

$$P(\text{free plan}) = \frac{500}{1200} = 0.4167$$

Addition rule: What's the probability a user is on the Free plan OR uses mobile?

$$P(\text{free or mobile}) = \frac{500}{1200} + \frac{400}{1200} - \frac{245}{1200} = \frac{655}{1200} = 0.5458$$

Python: Building and Analyzing Contingency Tables

import pandas as pd
import numpy as np

# Create Maya's health data as a DataFrame
np.random.seed(42)

# Simulate 500 observations matching Maya's table
smoking_status = (['Smoker'] * 160) + (['Non-Smoker'] * 340)
respiratory = (['Yes'] * 68 + ['No'] * 92 +   # Smokers
               ['Yes'] * 42 + ['No'] * 298)     # Non-Smokers

health_df = pd.DataFrame({
    'smoking_status': smoking_status,
    'respiratory_illness': respiratory
})

# Create the contingency table
contingency = pd.crosstab(health_df['smoking_status'],
                          health_df['respiratory_illness'],
                          margins=True)
print("Contingency Table (Counts):")
print(contingency)
print()

# Convert to probabilities (divide every cell by grand total)
prob_table = pd.crosstab(health_df['smoking_status'],
                         health_df['respiratory_illness'],
                         margins=True,
                         normalize='all')
print("Contingency Table (Joint Probabilities):")
print(prob_table.round(4))
print()

# Calculate specific probabilities
total = len(health_df)
print(f"P(Smoker) = {160/total:.4f}")
print(f"P(Respiratory Illness) = {110/total:.4f}")
print(f"P(Smoker AND Illness) = {68/total:.4f}")
print(f"P(Smoker OR Illness) = {(160 + 110 - 68)/total:.4f}")
print(f"P(Non-Smoker AND No Illness) = {298/total:.4f}")

8.8 Putting It All Together: Worked Examples

Let's work through some examples that combine multiple rules. This is where probability starts to feel powerful.

Worked Example 1: Professor Washington's Data

Professor Washington is studying 800 cases reviewed by a predictive policing algorithm. He categorizes each case by the algorithm's recommendation and the actual outcome.

	Re-Offense	No Re-Offense	Total
Algorithm: High Risk	120	180	300
Algorithm: Low Risk	40	460	500
Total	160	640	800

a) What's the probability a randomly selected case was flagged as high risk?

$$P(\text{high risk}) = \frac{300}{800} = 0.375$$

b) What's the probability a case involved a re-offense AND was flagged as high risk?

$$P(\text{re-offense and high risk}) = \frac{120}{800} = 0.150$$

c) What's the probability a case involved a re-offense OR was flagged as high risk (or both)?

$$P(\text{re-offense or high risk}) = P(\text{re-offense}) + P(\text{high risk}) - P(\text{both})$$ $$= \frac{160}{800} + \frac{300}{800} - \frac{120}{800} = \frac{340}{800} = 0.425$$

d) What's the probability a case was NOT flagged as high risk?

$$P(\text{not high risk}) = 1 - P(\text{high risk}) = 1 - 0.375 = 0.625$$

e) If Washington selects two cases at random (with replacement), what's the probability both were flagged as high risk?

Since the selections are independent (with replacement):

$$P(\text{both high risk}) = 0.375 \times 0.375 = 0.1406$$

Worked Example 2: Dice Probabilities

You roll two fair dice. What's the probability the sum is 7?

First, let's figure out the sample space. Each die has 6 outcomes, so there are $6 \times 6 = 36$ equally likely outcomes for two dice.

How many ways can the sum be 7?

Die 1	Die 2	Sum
1	6	7
2	5	7
3	4	7
4	3	7
5	2	7
6	1	7

Six ways.

$$P(\text{sum} = 7) = \frac{6}{36} = \frac{1}{6} \approx 0.167$$

Let's verify with a simulation:

import numpy as np

np.random.seed(42)

# Simulate rolling two dice 100,000 times
n_rolls = 100000
die1 = np.random.randint(1, 7, size=n_rolls)
die2 = np.random.randint(1, 7, size=n_rolls)
sums = die1 + die2

# Count how many times the sum was 7
count_7 = np.sum(sums == 7)
proportion_7 = count_7 / n_rolls

print(f"Out of {n_rolls:,} rolls:")
print(f"  Sum of 7 occurred {count_7:,} times")
print(f"  Proportion: {proportion_7:.4f}")
print(f"  Theoretical: {1/6:.4f}")
print(f"\nThe simulation closely matches the theoretical probability!")

# Bonus: distribution of all possible sums
print(f"\nDistribution of sums (2-12):")
for s in range(2, 13):
    count = np.sum(sums == s)
    bar = '█' * int(count / n_rolls * 100)
    print(f"  Sum {s:>2}: {count/n_rolls:.4f} {bar}")

Worked Example 3: Combining Rules

A bag contains 5 red marbles, 3 blue marbles, and 2 green marbles (10 total).

a) What's the probability of drawing a red marble?

$$P(\text{red}) = \frac{5}{10} = 0.50$$

b) What's the probability of drawing a red OR blue marble?

These are mutually exclusive (a marble can't be both colors):

$$P(\text{red or blue}) = P(\text{red}) + P(\text{blue}) = \frac{5}{10} + \frac{3}{10} = \frac{8}{10} = 0.80$$

c) What's the probability of NOT drawing a green marble?

$$P(\text{not green}) = 1 - P(\text{green}) = 1 - \frac{2}{10} = 0.80$$

Notice that (b) and (c) give the same answer — and they should! "Red or blue" is the same thing as "not green" when those are the only three colors.

d) You draw two marbles WITH replacement. What's the probability both are red?

The draws are independent (you put the first marble back):

$$P(\text{both red}) = 0.50 \times 0.50 = 0.25$$

e) What's the probability of drawing at least one blue marble in two draws (with replacement)?

Use the complement:

$$P(\text{at least one blue}) = 1 - P(\text{no blue in either draw})$$ $$= 1 - P(\text{not blue}) \times P(\text{not blue})$$ $$= 1 - \frac{7}{10} \times \frac{7}{10} = 1 - 0.49 = 0.51$$

8.9 The Birthday Problem Revisited

Let's return to the birthday puzzle from Section 8.1 — now you have the tools to solve it.

Setup: How many people do you need in a room for a greater than 50% chance that at least two share a birthday? (Assume 365 equally likely birthdays, ignore leap years.)

Strategy: Use the complement rule! It's much easier to calculate $P(\text{no matches})$ than $P(\text{at least one match})$.

$$P(\text{at least one match}) = 1 - P(\text{no matches})$$

Step 1: Person 1 walks in. Their birthday can be anything — no chance of a match yet.

Step 2: Person 2 walks in. For NO match, Person 2's birthday must avoid Person 1's birthday. That's 364 out of 365 possible days.

$$P(\text{no match with 2 people}) = \frac{364}{365}$$

Step 3: Person 3 walks in. For still no match, Person 3 must avoid both Person 1's and Person 2's birthdays. That's 363 out of 365.

$$P(\text{no match with 3 people}) = \frac{364}{365} \times \frac{363}{365}$$

The pattern continues. For $n$ people, all with different birthdays:

$$P(\text{no match with } n \text{ people}) = \frac{364}{365} \times \frac{363}{365} \times \frac{362}{365} \times \cdots \times \frac{365 - n + 1}{365}$$

$$= \prod_{k=1}^{n-1} \frac{365 - k}{365}$$

Let's compute this in Python:

import numpy as np
import matplotlib.pyplot as plt

def birthday_probability(n):
    """Calculate P(at least one shared birthday) for n people."""
    if n > 365:
        return 1.0
    p_no_match = 1.0
    for k in range(1, n):
        p_no_match *= (365 - k) / 365
    return 1 - p_no_match

# Calculate for different group sizes
group_sizes = range(2, 61)
probabilities = [birthday_probability(n) for n in group_sizes]

# Find where probability crosses 50%
for n, p in zip(group_sizes, probabilities):
    if p >= 0.5:
        print(f"At n = {n} people, P(shared birthday) = {p:.4f}")
        print(f"At n = {n-1} people, P(shared birthday) = "
              f"{birthday_probability(n-1):.4f}")
        break

# Plot
plt.figure(figsize=(10, 5))
plt.plot(list(group_sizes), probabilities, color='steelblue', linewidth=2)
plt.axhline(y=0.5, color='red', linestyle='--', label='50% threshold')
plt.axvline(x=23, color='orange', linestyle='--', label='n = 23')
plt.xlabel('Number of People in Room', fontsize=12)
plt.ylabel('P(at least one shared birthday)', fontsize=12)
plt.title('The Birthday Problem', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Print a table of key values
print(f"\n{'People':>8} {'P(match)':>12} {'Surprise?':>12}")
print("-" * 35)
for n in [5, 10, 15, 20, 23, 25, 30, 40, 50, 57]:
    p = birthday_probability(n)
    surprise = "50% CROSSED!" if n == 23 else ("99%!" if p > 0.99 else "")
    print(f"{n:>8} {p:>12.4f} {surprise:>12}")

Key results: - At 23 people: P(match) ≈ 0.5073 — just over 50%! - At 50 people: P(match) ≈ 0.9704 — over 97%! - At 57 people: P(match) ≈ 0.9900 — 99%!

The reason 23 feels so low is that we underestimate the number of pairs. With 23 people, there are $\binom{23}{2} = 253$ possible pairs to check. Each pair has only a $\frac{1}{365}$ chance of matching, but 253 chances add up quickly.

8.10 Probability and AI: Speaking the Language of Uncertainty

Theme Connection: AI Uses Probability Constantly (Theme 3)

Every time you use a modern AI system, probability is working behind the scenes. Here's how:

Spam filters calculate $P(\text{spam} \mid \text{words in email})$ — the probability an email is spam given the words it contains. We'll learn the full machinery (Bayes' theorem) in Chapter 9, but the foundation is right here in Chapter 8.

Recommendation engines (like StreamVibe's) estimate $P(\text{user watches movie B} \mid \text{user watched movie A})$. Alex's recommendation algorithm is essentially a giant probability machine.

Weather forecasts combine sensor data, historical patterns, and simulation models to produce statements like "30% chance of rain." Now you know that means: in the long run, on days with atmospheric conditions like today's, it rains about 30% of the time.

Medical AI systems calculate the probability that a medical image shows a tumor, that a patient's symptoms indicate a specific disease, or that a treatment will be effective for a particular patient profile.

Large language models (like ChatGPT) work by calculating the probability of each possible next word, given all the previous words. When an LLM writes a sentence, it's doing billions of probability calculations — choosing the most likely (or sometimes creatively less likely) next word at each step.

Understanding probability doesn't just make you better at statistics. It makes you a more informed citizen of a world increasingly shaped by probabilistic algorithms.

Theme Connection: Uncertainty Is the Point (Theme 4)

Here's a thought that might feel counterintuitive: embracing uncertainty is more honest and more useful than pretending to have certainty.

When a doctor says "there's a 15% chance this treatment will cause side effects," that's more helpful than "this treatment might cause side effects." The number quantifies the risk in a way that supports informed decisions.

When an election forecast says "Candidate A has a 70% chance of winning," that's more informative than "Candidate A is expected to win" — because it also tells you there's a very real 30% chance they won't. (After the 2016 U.S. presidential election, many people who saw "70-80% chance" misread it as "certainty." They hadn't internalized what probability really means.)

Probability gives us a language for uncertainty. Not the vague, hand-wavy "who knows?" kind of uncertainty — but the precise, quantifiable, decision-enabling kind. That's the gift of this chapter.

8.11 Progressive Project Checkpoint: Your Dataset's Contingency Tables

Time to apply probability to your own Data Detective Portfolio.

Your Task

Identify two categorical variables in your dataset. (Remember from Chapter 2: categorical variables classify observations into groups. If your dataset is mostly numerical, consider binning a numerical variable into categories using the techniques from Chapter 7.)
Create a contingency table showing the counts for every combination of the two variables.

import pandas as pd

# Load your clean dataset (from Chapter 7)
df = pd.read_csv('your_clean_dataset.csv')

# Create a contingency table
contingency = pd.crosstab(df['variable_1'], df['variable_2'], margins=True)
print(contingency)

# Convert to proportions (joint probabilities)
prob_table = pd.crosstab(df['variable_1'], df['variable_2'],
                         margins=True, normalize='all')
print(prob_table.round(4))

Calculate at least three probabilities from your table: - One marginal probability - One joint probability - One "or" probability using the addition rule
Use the complement rule at least once. Calculate the probability of an event NOT happening.
Write 2-3 sentences interpreting your probabilities in context. Don't just state the numbers — explain what they mean for your specific dataset and research question.

Example: What Good Output Looks Like

"In the BRFSS dataset, I created a contingency table crossing smoking status with general health rating. The probability of being a current smoker is 0.18 (18%), and the probability of reporting 'Poor' or 'Fair' health is 0.21 (21%). The joint probability of being both a current smoker AND reporting poor/fair health is 0.063 (6.3%). Using the addition rule, P(smoker or poor/fair health) = 0.18 + 0.21 - 0.063 = 0.327, meaning about one-third of respondents are either smokers, report poor/fair health, or both. Using the complement rule, P(non-smoker) = 1 - 0.18 = 0.82."

8.12 Chapter Summary

Let's take stock of what you've learned.

You now have a formal language for uncertainty. You know that probability can be defined three ways — classical (counting equally likely outcomes), relative frequency (learning from repeated data), and subjective (quantifying expert judgment) — and that all three follow the same mathematical rules.

You've learned those rules:

The complement rule ($P(A') = 1 - P(A)$) lets you flip a problem on its head.
The addition rule ($P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B)$) handles "or" questions.
The multiplication rule ($P(A \text{ and } B) = P(A) \times P(B)$ for independent events) handles "and" questions.

You've learned to build and interpret contingency tables, calculating marginal probabilities, joint probabilities, and applying the addition rule to real data.

You've seen the law of large numbers in action — the principle that probability estimates improve with more data — and you've debunked the gambler's fallacy that past random events influence future ones.

And perhaps most importantly, you've begun the shift toward probabilistic thinking: reasoning about patterns and tendencies rather than individual certainties. This shift is the foundation for everything in Parts 4, 5, 6, and 7.

What's Next

In Chapter 9, we'll take probability further by asking: What happens when you get new information? If you know a patient tested positive for a disease, how does that change the probability that they actually have the disease? The answer — Bayes' theorem — is one of the most powerful ideas in all of statistics, and it's the engine behind AI systems from spam filters to self-driving cars.

The foundation you've built in this chapter is exactly what you need to get there.

Key Formulas at a Glance

Rule	Formula	When to Use
Complement	$P(A') = 1 - P(A)$	Finding "not A" is easier than "A"
Addition (mutually exclusive)	$P(A \text{ or } B) = P(A) + P(B)$	A and B can't both happen
Addition (general)	$P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B)$	A and B might overlap
Multiplication (independent)	$P(A \text{ and } B) = P(A) \times P(B)$	A and B don't influence each other
Classical probability	$P(A) = \frac{\text{favorable outcomes}}{\text{total outcomes}}$	All outcomes equally likely
Relative frequency	$P(A) \approx \frac{\text{times A occurred}}{\text{total trials}}$	You have data from repeated trials

Prerequisites

Learning Objectives

In This Chapter

Chapter 8: Probability: The Foundation of Inference

Chapter Overview

8.1 A Puzzle Before We Start (Productive Struggle)

8.2 What Is Probability?

The Classical Approach

The Relative Frequency Approach

The Subjective Approach

Comparing the Three Approaches

8.3 The Law of Large Numbers: Why More Data Means Better Estimates

Seeing It in Action: A Simulation

The Gambler's Fallacy: What the Law of Large Numbers Does Not Say

8.4 Basic Probability Rules: Building Your Toolkit

Rule 1: Probabilities Must Be Between 0 and 1

Rule 2: The Probabilities of All Outcomes Must Sum to 1

Rule 3: The Complement Rule

8.5 The Addition Rule: "Or" Probabilities

The Addition Rule for Mutually Exclusive Events

The General Addition Rule

Visualizing the Addition Rule

8.6 The Multiplication Rule: "And" Probabilities

The Multiplication Rule for Independent Events

A Critical Distinction: Mutually Exclusive ≠ Independent

8.7 Two-Way Tables: Probability from Data

Example: Maya's Health Study

Marginal Probability

Joint Probability

Using the Table to Verify the Addition Rule

Example: Alex's User Segments

Python: Building and Analyzing Contingency Tables

8.8 Putting It All Together: Worked Examples

Worked Example 1: Professor Washington's Data

Worked Example 2: Dice Probabilities

Worked Example 3: Combining Rules

8.9 The Birthday Problem Revisited

8.10 Probability and AI: Speaking the Language of Uncertainty

8.11 Progressive Project Checkpoint: Your Dataset's Contingency Tables

Your Task

Example: What Good Output Looks Like

8.12 Chapter Summary

What's Next

Key Formulas at a Glance

Related Reading