Key Takeaways: Distributions and the Normal Curve

Contributors to Introduction to Data Science

Key Takeaways: Distributions and the Normal Curve

This is your reference card for Chapter 21. These concepts are the foundation for everything in Chapters 22-27.

What Is a Probability Distribution?

A mathematical description of the probabilities of all possible outcomes. It's the idealized version of a histogram — the shape you'd approach with infinite data.

Type	Can Take Values...	Described By	Example
Discrete	Countable: 0, 1, 2, 3, ...	PMF (probability mass function)	Coin flips, die rolls, defect counts
Continuous	Any value in a range	PDF (probability density function)	Height, temperature, time

Key point: For continuous distributions, probabilities are AREAS under the curve, not heights.

The Distribution Zoo

Distribution	Type	Use When	Parameters	Key Property
Uniform	Both	All outcomes equally likely	Low, High	Simplest distribution
Binomial	Discrete	Counting successes in n trials	n, p	Mean = np, Var = np(1-p)
Poisson	Discrete	Counting events in fixed interval	lambda	Mean = Var = lambda
Normal	Continuous	Symmetric, bell-shaped data	mu, sigma	The star of the show

The Normal Distribution

Defined by just two parameters: - mu (mean) — the center of the bell - sigma (standard deviation) — the width of the bell

Properties: - Symmetric: mean = median = mode - Bell-shaped with tails that extend to infinity - Completely specified by mu and sigma

The Empirical Rule (68-95-99.7)

|--------- 99.7% ---------|
|     |--- 95% ---|        |
|     |  |-68%-|  |        |
      |  |     |  |
──────┼──┼──┼──┼──┼────────
   μ-3σ μ-2σ μ-σ μ μ+σ μ+2σ μ+3σ

68% within 1 SD
95% within 2 SD
99.7% within 3 SD

Z-Scores

$$z = \frac{x - \mu}{\sigma}$$

"How many standard deviations is x from the mean?"

z = 0: at the mean
z = +1: one SD above (84th percentile)
z = -1: one SD below (16th percentile)
z = +2: two SDs above (97.7th percentile)
|z| > 3: very unusual

Using `scipy.stats`

Every distribution follows the same pattern:

from scipy import stats

# Create a distribution object
dist = stats.norm(loc=mean, scale=std)

# Four essential methods:
dist.pdf(x)       # Density at x (height of curve)
dist.cdf(x)       # P(X <= x) (area to the left)
dist.ppf(q)       # Value at percentile q (inverse CDF)
dist.rvs(size=n)  # Generate n random samples

# Common calculations:
P_below = dist.cdf(x)              # P(X < x)
P_above = 1 - dist.cdf(x)          # P(X > x)
P_between = dist.cdf(b) - dist.cdf(a)  # P(a < X < b)
percentile_value = dist.ppf(0.95)   # What value is the 95th percentile?

The Central Limit Theorem

If you take random samples of size n from ANY population with finite mean and variance, the distribution of sample means approaches a normal distribution as n increases.

Three things to remember: 1. Works for ANY shape — uniform, skewed, bimodal, anything 2. n >= 30 is usually enough (more for heavily skewed data) 3. Standard error = sigma / sqrt(n) — larger samples give more precise means

Population (any shape) ──► Take many samples of size n ──► Compute mean of each
                                                                    |
                                                                    v
                                                           Distribution of means
                                                           is approximately NORMAL
                                                           with:
                                                             mean = population mean
                                                             SE = sigma / sqrt(n)

Checking Normality

Method	How It Works	Verdict
Histogram	Visual — does it look bell-shaped?	Quick but subjective
Q-Q Plot	Compare quantiles to theoretical normal	Best visual diagnostic
Shapiro-Wilk	Formal hypothesis test	Rejects normality too easily for large n

Reading a Q-Q plot: - Points on line = normal - Curves at ends = heavy/light tails - Curves in one direction = skewed

PDF vs. CDF at a Glance

	PDF	CDF
What it shows	Density at each point	Cumulative probability up to each point
Y-axis range	0 to any positive value	0 to 1
Shape for normal	Bell curve	S-curve (sigmoid)
How to get P(a < X < b)	Area under curve from a to b	CDF(b) - CDF(a)

What You Should Be Able to Do Now

[ ] Distinguish discrete from continuous distributions
[ ] Describe when to use uniform, binomial, Poisson, and normal distributions
[ ] Apply the empirical rule (68-95-99.7) for normal data
[ ] Convert between raw scores and z-scores
[ ] Use scipy.stats to compute PDF, CDF, inverse CDF, and random samples
[ ] Explain the Central Limit Theorem in plain English
[ ] Demonstrate the CLT through simulation
[ ] Assess normality using Q-Q plots and the Shapiro-Wilk test
[ ] Know when the normal distribution is and isn't appropriate

If all of these feel solid, you have the foundation for confidence intervals (Chapter 22) and hypothesis testing (Chapter 23). The CLT and normal distribution are the engine that powers statistical inference.

Key Takeaways: Distributions and the Normal Curve

What Is a Probability Distribution?

The Distribution Zoo

The Normal Distribution

The Empirical Rule (68-95-99.7)

Z-Scores

Using scipy.stats

The Central Limit Theorem

Checking Normality

PDF vs. CDF at a Glance

What You Should Be Able to Do Now

Using `scipy.stats`