Chapter 21 Exercises: Distributions and the Normal Curve
How to use these exercises: Each section builds in complexity. For simulation exercises, always compare your simulated results with analytical (formula-based) results when possible — the agreement between them builds confidence in both approaches.
Difficulty key: Foundational | Intermediate | Advanced | Extension
Part A: Conceptual Understanding (Foundational)
Exercise 21.1 — Discrete vs. continuous
Classify each of the following random variables as discrete or continuous. For each, name a distribution that might be appropriate.
(a) The number of emails you receive in an hour (b) The height of a randomly selected adult (c) The number of heads in 20 coin flips (d) The time until the next bus arrives (e) The number of customers who buy a product today (f) A student's GPA
Guidance
(a) Discrete — Poisson distribution (counting events in a fixed interval). (b) Continuous — Normal distribution (symmetric, bell-shaped). (c) Discrete — Binomial distribution (fixed trials, two outcomes). (d) Continuous — Exponential distribution (time until an event). (e) Discrete — Poisson or Binomial depending on whether there's a fixed population. (f) Technically continuous (can take values like 3.47), though in practice it's computed from discrete grades. Could be modeled as continuous.Exercise 21.2 — The empirical rule
The scores on a standardized exam are normally distributed with a mean of 500 and a standard deviation of 100.
(a) What range contains the middle 68% of scores? (b) What percentage of students score above 700? (c) What percentage score between 300 and 700? (d) A score of 200 is how many standard deviations below the mean? Is this score unusual? (e) If 10,000 students take the exam, approximately how many score above 800?
Guidance
(a) 400 to 600 (mean plus/minus 1 SD). (b) 700 is 2 SDs above the mean. About 2.5% score above 700 (half of the 5% outside 2 SDs). (c) 300-700 is mean plus/minus 2 SDs = about 95%. (d) z = (200-500)/100 = -3. Three standard deviations below — very unusual (only about 0.15% of scores are this low). (e) 800 is 3 SDs above. About 0.15% score above 800. 10,000 * 0.0015 ≈ 15 students.Exercise 21.3 — PDF vs. CDF
(a) In your own words, explain what the PDF tells you and what the CDF tells you. (b) If the PDF is tall at a particular x-value, what does that mean? (c) The CDF at x=70 is 0.84. What does this mean in words? (d) How do you use the CDF to find P(60 < X < 80)? (e) For a continuous distribution, why is P(X = exactly 3.7) equal to zero?
Guidance
(a) PDF tells you where probability is concentrated (density). CDF tells you the cumulative probability up to a given point — P(X <= x). (b) It means probability is concentrated there — values near x are more likely than values where the PDF is low. But the height itself is NOT a probability. (c) 84% of the values are less than or equal to 70. (d) P(60 < X < 80) = CDF(80) - CDF(60). (e) Because for a continuous variable, there are infinitely many possible values. The probability of hitting any exact value is zero — probabilities only exist for ranges (intervals).Exercise 21.4 — Z-score interpretation
Jordan scored 78 on a test where the mean was 72 and the standard deviation was 8. Priya scored 85 on a different test where the mean was 80 and the standard deviation was 10.
(a) Compute z-scores for both. (b) Who performed better relative to their class? (c) Assuming normal distributions, what percentile is each student at? (d) What score would Jordan need to be at the 95th percentile?
Guidance
(a) Jordan: z = (78-72)/8 = 0.75. Priya: z = (85-80)/10 = 0.50. (b) Jordan performed better relative to their class (higher z-score). (c) Jordan: stats.norm.cdf(0.75) ≈ 0.773 → 77th percentile. Priya: stats.norm.cdf(0.50) ≈ 0.691 → 69th percentile. (d) The 95th percentile z-score is 1.645. Score = 72 + 1.645 * 8 = 85.2.Exercise 21.5 — CLT in plain English
Explain the Central Limit Theorem to a friend who hasn't taken a statistics class. Use an analogy or example. Make sure to address these points: (a) Does the population need to be normal for the CLT to work? (b) What role does sample size play? (c) Why is this theorem important for surveys and polls?
Guidance
No, the population does NOT need to be normal — that's the magic. Even if individual values follow a weird distribution, the average of a sample tends toward a normal shape. Larger samples produce better bell curves. This matters for polls because it means we can estimate population characteristics from samples and quantify our uncertainty — even if the underlying opinions are not normally distributed.Exercise 21.6 — Matching distributions
Match each scenario with the most appropriate distribution:
| Scenario | Distribution |
|---|---|
| Height of adult women | |
| Number of typos on a page | |
| Whether each of 50 people buy a product (yes/no) | |
| Temperature tomorrow (in degrees) | |
| Random number between 1 and 100 | |
| Number of cars passing a toll booth per hour |
Options: Normal, Binomial, Poisson, Uniform
Guidance
Height → Normal. Typos → Poisson. 50 purchases → Binomial. Temperature → Normal. Random number 1-100 → Uniform. Cars per hour → Poisson.Exercise 21.7 — Reading Q-Q plots
Describe what the following Q-Q plot patterns indicate about the data: (a) Points fall exactly on the diagonal line (b) Points curve upward at the right end and downward at the left end (c) Points curve upward at both ends (d) Points follow the line in the middle but deviate at both tails
Guidance
(a) Data is approximately normal. (b) Data is right-skewed — the right tail is heavier than normal. (c) Data has heavier tails than normal (leptokurtic) — more extreme values in both directions. (d) Same as (c) — heavy tails. This is common in financial data.Exercise 21.8 — Binomial in context
A vaccine is 88% effective. In a group of 200 vaccinated people exposed to the disease: (a) What is the expected number who develop immunity? (b) What is the standard deviation? (c) What is the probability that fewer than 170 develop immunity? (d) Is it reasonable to approximate this with a normal distribution? Why?
Guidance
(a) E(X) = n*p = 200 * 0.88 = 176. (b) SD = sqrt(n*p*(1-p)) = sqrt(200 * 0.88 * 0.12) = sqrt(21.12) ≈ 4.6. (c) Use stats.binom.cdf(169, 200, 0.88) ≈ 0.067 (about 6.7%). (d) Yes — when n*p and n*(1-p) are both > 5, the normal approximation to the binomial is reasonable. Here n*p = 176 and n*(1-p) = 24, both well above 5.Part B: Applied Coding (Intermediate)
Exercise 21.9 — scipy.stats exploration
Using scipy.stats.norm, compute the following for a normal distribution with mean 100 and standard deviation 15:
(a) P(X < 85)
(b) P(X > 120)
(c) P(90 < X < 110)
(d) The value at the 25th percentile
(e) The value at the 99th percentile
(f) Generate 1000 random samples and verify that the sample mean is close to 100
from scipy import stats
dist = stats.norm(100, 15)
# Use dist.cdf(), dist.ppf(), dist.rvs() to answer each part
Exercise 21.10 — Distribution fitting
Generate 1000 samples from each of these distributions: normal, exponential, and uniform. For each: (a) Plot a histogram (b) Overlay the theoretical PDF (c) Create a Q-Q plot against the normal distribution (d) Run the Shapiro-Wilk test
Present all results in a 3x3 grid of subplots.
Exercise 21.11 — CLT demonstration
Write a function demonstrate_clt(population, sample_sizes, n_samples) that:
(a) Takes any population array and a list of sample sizes
(b) For each sample size, draws n_samples random samples and computes their means
(c) Plots the distribution of sample means for each sample size
(d) Overlays the theoretical normal curve predicted by the CLT
(e) Prints the mean and standard error for each sample size
Test it with three different population shapes: uniform, exponential, and bimodal.
Guidance
def demonstrate_clt(population, sample_sizes, n_samples=5000):
pop_mean = np.mean(population)
pop_std = np.std(population)
fig, axes = plt.subplots(1, len(sample_sizes), figsize=(5*len(sample_sizes), 4))
for ax, n in zip(axes, sample_sizes):
means = [np.mean(np.random.choice(population, n)) for _ in range(n_samples)]
se = pop_std / np.sqrt(n)
ax.hist(means, bins=40, density=True, alpha=0.7)
x = np.linspace(min(means), max(means), 100)
ax.plot(x, stats.norm.pdf(x, pop_mean, se), 'r-', linewidth=2)
ax.set_title(f'n={n}')
plt.tight_layout()
plt.show()
Exercise 21.12 — Binomial simulation vs. formula
A fair coin is flipped 100 times. Using both simulation (flip 100 coins 10,000 times) and scipy.stats.binom:
(a) Estimate P(exactly 50 heads)
(b) Estimate P(45 or fewer heads)
(c) Estimate P(between 45 and 55 heads, inclusive)
(d) Compare the binomial probabilities to the normal approximation N(50, 5)
Exercise 21.13 — Poisson modeling
Marcus's bakery averages 42 customers per hour during the morning rush (7-9 AM). (a) Model this using a Poisson distribution. What's the probability of getting 50+ customers in a given hour? (b) What's the probability of getting fewer than 30? (c) Simulate 1000 hours and compare the simulated distribution to the theoretical Poisson (d) Marcus can handle up to 55 customers per hour before service quality drops. How often will this happen?
Exercise 21.14 — Z-score probability calculator
Write a function normal_probability(mean, std, lower=None, upper=None) that computes:
- P(X < upper) if only upper is given
- P(X > lower) if only lower is given
- P(lower < X < upper) if both are given
It should also print a plain-English interpretation. Test with real-world examples.
Guidance
def normal_probability(mean, std, lower=None, upper=None):
dist = stats.norm(mean, std)
if lower is not None and upper is not None:
prob = dist.cdf(upper) - dist.cdf(lower)
print(f"P({lower} < X < {upper}) = {prob:.4f} ({prob*100:.1f}%)")
elif upper is not None:
prob = dist.cdf(upper)
print(f"P(X < {upper}) = {prob:.4f} ({prob*100:.1f}%)")
elif lower is not None:
prob = 1 - dist.cdf(lower)
print(f"P(X > {lower}) = {prob:.4f} ({prob*100:.1f}%)")
return prob
Exercise 21.15 — Standard error exploration
Using simulation, demonstrate that the standard error of the mean equals sigma/sqrt(n): (a) Generate a population with known sigma (b) For sample sizes n = 5, 10, 25, 50, 100, 500, take 5000 samples of each size (c) Compute the standard deviation of the 5000 sample means for each n (d) Plot the simulated standard errors alongside the theoretical sigma/sqrt(n) (e) How well do they match?
Exercise 21.16 — Comparing distributions visually
Create a figure showing four distributions on the same axes: N(0,1), N(0,2), N(3,1), and N(3,2). Use different colors and a legend. Explain how changing the mean shifts the curve and how changing the standard deviation changes the width.
Exercise 21.17 — Real data normality check
Load a real dataset (suggestions: seaborn's 'tips' dataset, or any CSV with numerical columns). Pick two numerical columns. For each: (a) Plot a histogram with normal overlay (b) Create a Q-Q plot (c) Run the Shapiro-Wilk test (d) Compute skewness and kurtosis (e) Conclude whether the normal distribution is a reasonable model
Exercise 21.18 — Inverse CDF problems
Using scipy.stats.norm.ppf():
(a) What IQ score puts you in the top 1%? (IQ: mean=100, SD=15)
(b) What height is the 10th percentile for adult women? (Mean=64 inches, SD=2.5 inches)
(c) A quality control process rejects items whose weight deviates by more than 2% from the target of 500g (SD=5g). What weight range is acceptable?
(d) Find the z-score that marks the boundary of the top 5%
Part C: Synthesis and Real-World Application (Advanced)
Exercise 21.19 — Elena's complete distribution analysis
Using the progressive project vaccination data, perform a complete distribution analysis: (a) Fit normal distributions to vaccination rates for each income group (b) Create Q-Q plots for each group (c) For groups where normality holds, use the fitted normal to answer: "What percentage of countries in this income group have vaccination rates below 70%?" (d) For groups where normality doesn't hold, suggest a better model and explain why
Exercise 21.20 — When the CLT breaks down
The CLT requires the population to have a finite mean and variance. Find a distribution where these conditions are violated (hint: the Cauchy distribution, available as stats.cauchy). Take samples of increasing size and show that the distribution of sample means does NOT converge to normal. This is a fascinating edge case.
Guidance
cauchy_pop = stats.cauchy.rvs(size=1000000)
# The Cauchy distribution has no defined mean or variance!
# Sample means will NOT converge to normal
for n in [10, 100, 1000]:
means = [np.mean(stats.cauchy.rvs(size=n)) for _ in range(5000)]
# Plot and observe — the distribution stays heavy-tailed
Exercise 21.21 — Normal approximation to the binomial
The normal distribution can approximate the binomial when n is large. Demonstrate this: (a) Plot Binomial(n=50, p=0.3) as a bar chart (b) Overlay Normal(mean=np.500.3, std=sqrt(500.3*0.7)) as a curve (c) Compare specific probabilities (e.g., P(X <= 10), P(X > 20)) (d) At what value of n does the approximation become "good enough"? (Try n = 5, 10, 20, 50, 100)
Exercise 21.22 — Distribution mixture
Create a mixture of two normal distributions: 60% from N(30, 5) and 40% from N(60, 8). This simulates data from two distinct populations (like Elena's low-income and high-income vaccination rates). (a) Plot the mixture (b) Show that the mean of the mixture doesn't describe either group well (c) Show that a Q-Q plot reveals the non-normality (d) Compute the mean, median, and standard deviation of the mixture and discuss why these are misleading
Exercise 21.23 — Log-normal distribution
Many real-world quantities (income, city populations, stock prices) follow a log-normal distribution — meaning that the logarithm of the data is normally distributed.
(a) Generate 10,000 samples from a log-normal distribution using np.random.lognormal()
(b) Show that the data itself is right-skewed
(c) Show that log(data) is approximately normal (histogram + Q-Q plot)
(d) Explain why income might follow a log-normal distribution (hint: think about multiplicative growth)
Exercise 21.24 — Comparing populations with normal models
Two schools report SAT scores: School A has mean=1050, SD=200 and School B has mean=1120, SD=150. Assuming normality: (a) What percentage of School A students score above 1200? (b) What percentage of School B students score above 1200? (c) At what score does School B's proportion exceed School A's? (d) If you randomly select one student from each school, what's the probability the School A student scores higher? (This requires knowledge that the difference of two normals is also normal.)
Part D: Extension Problems (Challenge)
Exercise 21.25 — The multivariate normal
Extend your understanding to two dimensions. Generate 1000 points from a bivariate normal distribution with correlation 0.7. Plot the scatter plot. Add contour lines showing the density. Explain how the correlation affects the shape of the cloud.
Guidance
mean = [0, 0]
cov = [[1, 0.7], [0.7, 1]]
data = np.random.multivariate_normal(mean, cov, 1000)
plt.scatter(data[:, 0], data[:, 1], alpha=0.3)
Exercise 21.26 — Bootstrap distribution
The bootstrap is a simulation technique for estimating the distribution of a statistic. Given a sample, you resample WITH replacement many times and compute the statistic each time. (a) Take a sample of 50 values from an exponential distribution (b) Resample with replacement 5000 times, computing the mean each time (c) Plot the bootstrap distribution of means (d) Compare it to the CLT prediction (e) Compute a 95% "bootstrap confidence interval" using the 2.5th and 97.5th percentiles
Exercise 21.27 — Power of normality tests
How good are normality tests at detecting non-normality? Generate samples from a slightly right-skewed distribution (e.g., chi-squared with 10 df). For sample sizes of 20, 50, 100, 500, and 2000, run the Shapiro-Wilk test 1000 times and record the proportion of times it rejects normality (at alpha=0.05). Plot the "detection rate" vs. sample size. What do you conclude about the power of normality tests?
Exercise 21.28 — Build a distribution explorer
Create an interactive (or at least parameterized) function that lets you explore different distributions. The function should accept a distribution name and parameters, and produce: a PDF plot, a CDF plot, random samples histogram, the mean, standard deviation, skewness, and kurtosis. Support at least: normal, uniform, exponential, binomial, and Poisson.
Exercise 21.29 — The German Tank Problem
In World War II, statisticians estimated the total number of German tanks by looking at the serial numbers on captured tanks. If you capture tanks with serial numbers 32, 85, 17, 93, and 41, how would you estimate the total number of tanks produced? The frequentist estimate is max(serial numbers) + max/n - 1. Simulate this problem: generate serial numbers from 1 to N (unknown), randomly "capture" some, and compare the estimate to the true N. How does the accuracy depend on sample size?
Exercise 21.30 — Your own distribution analysis
Find a real-world dataset (or use one from your progressive project) and perform a complete distribution analysis:
(a) Identify the variable and describe what it measures
(b) Plot the distribution (histogram + KDE)
(c) Check normality (Q-Q plot + Shapiro-Wilk)
(d) If not normal, identify the distribution family that best fits (try several using scipy.stats.fit())
(e) Use the fitted distribution to compute a meaningful probability
(f) Write up your analysis in 300-500 words
Reflection
After completing these exercises, you should be comfortable with:
- [ ] Distinguishing discrete and continuous distributions
- [ ] Using
scipy.statsto compute PDF, CDF, inverse CDF, and random samples - [ ] Applying the empirical rule (68-95-99.7) for normal distributions
- [ ] Using z-scores to compare values across different scales
- [ ] Explaining the Central Limit Theorem and demonstrating it through simulation
- [ ] Checking normality with Q-Q plots and the Shapiro-Wilk test
- [ ] Knowing when the normal distribution is and isn't appropriate
If all of these feel solid, you have a strong foundation for the inferential statistics of Chapters 22-23. The concepts of sampling distributions, standard error, and the CLT are the engine that powers confidence intervals and hypothesis tests.