Key Takeaways: Distributions and the Normal Curve

This is your reference card for Chapter 21. These concepts are the foundation for everything in Chapters 22-27.


What Is a Probability Distribution?

A mathematical description of the probabilities of all possible outcomes. It's the idealized version of a histogram — the shape you'd approach with infinite data.

Type Can Take Values... Described By Example
Discrete Countable: 0, 1, 2, 3, ... PMF (probability mass function) Coin flips, die rolls, defect counts
Continuous Any value in a range PDF (probability density function) Height, temperature, time

Key point: For continuous distributions, probabilities are AREAS under the curve, not heights.


The Distribution Zoo

Distribution Type Use When Parameters Key Property
Uniform Both All outcomes equally likely Low, High Simplest distribution
Binomial Discrete Counting successes in n trials n, p Mean = np, Var = np(1-p)
Poisson Discrete Counting events in fixed interval lambda Mean = Var = lambda
Normal Continuous Symmetric, bell-shaped data mu, sigma The star of the show

The Normal Distribution

Defined by just two parameters: - mu (mean) — the center of the bell - sigma (standard deviation) — the width of the bell

Properties: - Symmetric: mean = median = mode - Bell-shaped with tails that extend to infinity - Completely specified by mu and sigma

The Empirical Rule (68-95-99.7)

|--------- 99.7% ---------|
|     |--- 95% ---|        |
|     |  |-68%-|  |        |
      |  |     |  |
──────┼──┼──┼──┼──┼────────
   μ-3σ μ-2σ μ-σ μ μ+σ μ+2σ μ+3σ
  • 68% within 1 SD
  • 95% within 2 SD
  • 99.7% within 3 SD

Z-Scores

$$z = \frac{x - \mu}{\sigma}$$

"How many standard deviations is x from the mean?"

  • z = 0: at the mean
  • z = +1: one SD above (84th percentile)
  • z = -1: one SD below (16th percentile)
  • z = +2: two SDs above (97.7th percentile)
  • |z| > 3: very unusual

Using scipy.stats

Every distribution follows the same pattern:

from scipy import stats

# Create a distribution object
dist = stats.norm(loc=mean, scale=std)

# Four essential methods:
dist.pdf(x)       # Density at x (height of curve)
dist.cdf(x)       # P(X <= x) (area to the left)
dist.ppf(q)       # Value at percentile q (inverse CDF)
dist.rvs(size=n)  # Generate n random samples

# Common calculations:
P_below = dist.cdf(x)              # P(X < x)
P_above = 1 - dist.cdf(x)          # P(X > x)
P_between = dist.cdf(b) - dist.cdf(a)  # P(a < X < b)
percentile_value = dist.ppf(0.95)   # What value is the 95th percentile?

The Central Limit Theorem

If you take random samples of size n from ANY population with finite mean and variance, the distribution of sample means approaches a normal distribution as n increases.

Three things to remember: 1. Works for ANY shape — uniform, skewed, bimodal, anything 2. n >= 30 is usually enough (more for heavily skewed data) 3. Standard error = sigma / sqrt(n) — larger samples give more precise means

Population (any shape) ──► Take many samples of size n ──► Compute mean of each
                                                                    |
                                                                    v
                                                           Distribution of means
                                                           is approximately NORMAL
                                                           with:
                                                             mean = population mean
                                                             SE = sigma / sqrt(n)

Checking Normality

Method How It Works Verdict
Histogram Visual — does it look bell-shaped? Quick but subjective
Q-Q Plot Compare quantiles to theoretical normal Best visual diagnostic
Shapiro-Wilk Formal hypothesis test Rejects normality too easily for large n

Reading a Q-Q plot: - Points on line = normal - Curves at ends = heavy/light tails - Curves in one direction = skewed


PDF vs. CDF at a Glance

PDF CDF
What it shows Density at each point Cumulative probability up to each point
Y-axis range 0 to any positive value 0 to 1
Shape for normal Bell curve S-curve (sigmoid)
How to get P(a < X < b) Area under curve from a to b CDF(b) - CDF(a)

What You Should Be Able to Do Now

  • [ ] Distinguish discrete from continuous distributions
  • [ ] Describe when to use uniform, binomial, Poisson, and normal distributions
  • [ ] Apply the empirical rule (68-95-99.7) for normal data
  • [ ] Convert between raw scores and z-scores
  • [ ] Use scipy.stats to compute PDF, CDF, inverse CDF, and random samples
  • [ ] Explain the Central Limit Theorem in plain English
  • [ ] Demonstrate the CLT through simulation
  • [ ] Assess normality using Q-Q plots and the Shapiro-Wilk test
  • [ ] Know when the normal distribution is and isn't appropriate

If all of these feel solid, you have the foundation for confidence intervals (Chapter 22) and hypothesis testing (Chapter 23). The CLT and normal distribution are the engine that powers statistical inference.