Key Takeaways: Distributions and the Normal Curve
This is your reference card for Chapter 21. These concepts are the foundation for everything in Chapters 22-27.
What Is a Probability Distribution?
A mathematical description of the probabilities of all possible outcomes. It's the idealized version of a histogram — the shape you'd approach with infinite data.
| Type | Can Take Values... | Described By | Example |
|---|---|---|---|
| Discrete | Countable: 0, 1, 2, 3, ... | PMF (probability mass function) | Coin flips, die rolls, defect counts |
| Continuous | Any value in a range | PDF (probability density function) | Height, temperature, time |
Key point: For continuous distributions, probabilities are AREAS under the curve, not heights.
The Distribution Zoo
| Distribution | Type | Use When | Parameters | Key Property |
|---|---|---|---|---|
| Uniform | Both | All outcomes equally likely | Low, High | Simplest distribution |
| Binomial | Discrete | Counting successes in n trials | n, p | Mean = np, Var = np(1-p) |
| Poisson | Discrete | Counting events in fixed interval | lambda | Mean = Var = lambda |
| Normal | Continuous | Symmetric, bell-shaped data | mu, sigma | The star of the show |
The Normal Distribution
Defined by just two parameters: - mu (mean) — the center of the bell - sigma (standard deviation) — the width of the bell
Properties: - Symmetric: mean = median = mode - Bell-shaped with tails that extend to infinity - Completely specified by mu and sigma
The Empirical Rule (68-95-99.7)
|--------- 99.7% ---------|
| |--- 95% ---| |
| | |-68%-| | |
| | | |
──────┼──┼──┼──┼──┼────────
μ-3σ μ-2σ μ-σ μ μ+σ μ+2σ μ+3σ
- 68% within 1 SD
- 95% within 2 SD
- 99.7% within 3 SD
Z-Scores
$$z = \frac{x - \mu}{\sigma}$$
"How many standard deviations is x from the mean?"
- z = 0: at the mean
- z = +1: one SD above (84th percentile)
- z = -1: one SD below (16th percentile)
- z = +2: two SDs above (97.7th percentile)
- |z| > 3: very unusual
Using scipy.stats
Every distribution follows the same pattern:
from scipy import stats
# Create a distribution object
dist = stats.norm(loc=mean, scale=std)
# Four essential methods:
dist.pdf(x) # Density at x (height of curve)
dist.cdf(x) # P(X <= x) (area to the left)
dist.ppf(q) # Value at percentile q (inverse CDF)
dist.rvs(size=n) # Generate n random samples
# Common calculations:
P_below = dist.cdf(x) # P(X < x)
P_above = 1 - dist.cdf(x) # P(X > x)
P_between = dist.cdf(b) - dist.cdf(a) # P(a < X < b)
percentile_value = dist.ppf(0.95) # What value is the 95th percentile?
The Central Limit Theorem
If you take random samples of size n from ANY population with finite mean and variance, the distribution of sample means approaches a normal distribution as n increases.
Three things to remember: 1. Works for ANY shape — uniform, skewed, bimodal, anything 2. n >= 30 is usually enough (more for heavily skewed data) 3. Standard error = sigma / sqrt(n) — larger samples give more precise means
Population (any shape) ──► Take many samples of size n ──► Compute mean of each
|
v
Distribution of means
is approximately NORMAL
with:
mean = population mean
SE = sigma / sqrt(n)
Checking Normality
| Method | How It Works | Verdict |
|---|---|---|
| Histogram | Visual — does it look bell-shaped? | Quick but subjective |
| Q-Q Plot | Compare quantiles to theoretical normal | Best visual diagnostic |
| Shapiro-Wilk | Formal hypothesis test | Rejects normality too easily for large n |
Reading a Q-Q plot: - Points on line = normal - Curves at ends = heavy/light tails - Curves in one direction = skewed
PDF vs. CDF at a Glance
| CDF | ||
|---|---|---|
| What it shows | Density at each point | Cumulative probability up to each point |
| Y-axis range | 0 to any positive value | 0 to 1 |
| Shape for normal | Bell curve | S-curve (sigmoid) |
| How to get P(a < X < b) | Area under curve from a to b | CDF(b) - CDF(a) |
What You Should Be Able to Do Now
- [ ] Distinguish discrete from continuous distributions
- [ ] Describe when to use uniform, binomial, Poisson, and normal distributions
- [ ] Apply the empirical rule (68-95-99.7) for normal data
- [ ] Convert between raw scores and z-scores
- [ ] Use
scipy.statsto compute PDF, CDF, inverse CDF, and random samples - [ ] Explain the Central Limit Theorem in plain English
- [ ] Demonstrate the CLT through simulation
- [ ] Assess normality using Q-Q plots and the Shapiro-Wilk test
- [ ] Know when the normal distribution is and isn't appropriate
If all of these feel solid, you have the foundation for confidence intervals (Chapter 22) and hypothesis testing (Chapter 23). The CLT and normal distribution are the engine that powers statistical inference.