Chapter 21 Quiz: Distributions and the Normal Curve

Q: According to the empirical rule (68-95-99.7), if test scores are normally distributed with mean 500 and standard deviation 100, approximately what percentage of students score between 300 and 700? - (A) 68% - (B) 95% - (C) 99.7% - (D) 50%

Correct: (B) 300 is 2 standard deviations below the mean (500 - 2100 = 300), and 700 is 2 standard deviations above (500 + 2100 = 700). The empirical rule states that approximately 95% of values fall within 2 standard deviations of the mean.

Q: Which distribution would you use to model the number of defective items in a batch of 100, if each item has a 2% chance of being defective? - (A) Normal distribution - (B) Poisson distribution - (C) Binomial distribution - (D) Uniform distribution

Correct: (C) This is a classic binomial setting: a fixed number of trials (n=100), two outcomes per trial (defective or not), and a constant probability of "success" (p=0.02). The Poisson could also work as an approximation (since n is large and p is small), but the binomial is the exact model.

Q: True or False: The height of the PDF at a particular point gives the probability of that exact value.

False. For continuous distributions, the probability of any exact value is zero. The PDF gives the density, not the probability. Probabilities are obtained by computing areas under the PDF curve (i.e., integrating). P(a < X < b) = the area under the PDF from a to b = CDF(b) - CDF(a).

Q: True or False: For the normal distribution, the mean, median, and mode are all equal.

True. The normal distribution is perfectly symmetric. The peak of the bell curve (mode) is at the center, which is also the mean and the median. This is a defining property of the normal distribution and is not generally true for other distributions.

Contributors to Introduction to Data Science

Chapter 21 Quiz: Distributions and the Normal Curve

Instructions: Answer all questions before checking solutions. For multiple choice, select the best answer. For short answer, aim for 2-4 clear sentences. Total points: 100.

Section 1: Multiple Choice (10 questions, 4 points each)

Question 1. Which of the following BEST describes a probability distribution?

(A) A histogram of observed data
(B) A mathematical description of the probabilities of all possible outcomes of a random variable
(C) The standard deviation of a dataset
(D) A plot of cumulative frequencies

Answer

**Correct: (B)** A probability distribution is a mathematical model that assigns probabilities to all possible outcomes. A histogram (A) is an empirical summary of observed data — it approximates the distribution but isn't the distribution itself. The standard deviation (C) is a single number, not a distribution. A cumulative frequency plot (D) is related to the CDF but describes observed data rather than a theoretical probability model.

Question 2. According to the empirical rule (68-95-99.7), if test scores are normally distributed with mean 500 and standard deviation 100, approximately what percentage of students score between 300 and 700?

(A) 68%
(B) 95%
(C) 99.7%
(D) 50%

Answer

**Correct: (B)** 300 is 2 standard deviations below the mean (500 - 2*100 = 300), and 700 is 2 standard deviations above (500 + 2*100 = 700). The empirical rule states that approximately 95% of values fall within 2 standard deviations of the mean.

Question 3. The Central Limit Theorem states that:

(A) All data eventually becomes normally distributed with enough observations
(B) The sample mean equals the population mean for large samples
(C) The distribution of sample means approaches a normal distribution as sample size increases, regardless of the population's shape
(D) Normal distributions are the most common distributions in nature

Answer

**Correct: (C)** The CLT specifically says that the *distribution of sample means* (not the data itself) approaches normality as n increases, no matter what shape the original population has. (A) is wrong — the raw data doesn't become normal. (B) confuses the CLT with the law of large numbers. (D) is a claim about nature, not about the CLT.

Question 4. A Q-Q plot shows points that curve upward at both ends relative to the diagonal line. This suggests the data:

(A) Is perfectly normal
(B) Has heavier tails than a normal distribution (more extreme values)
(C) Is left-skewed
(D) Has lighter tails than a normal distribution (fewer extreme values)

Answer

**Correct: (B)** When Q-Q plot points curve upward at both extremes, it means the data has more extreme values (both high and low) than a normal distribution would predict. This is called "heavy tails" or "leptokurtic." This pattern is common in financial return data and explains why "once-in-a-century" market crashes happen more often than normal models predict.

Question 5. Which distribution would you use to model the number of defective items in a batch of 100, if each item has a 2% chance of being defective?

(A) Normal distribution
(B) Poisson distribution
(C) Binomial distribution
(D) Uniform distribution

Answer

**Correct: (C)** This is a classic binomial setting: a fixed number of trials (n=100), two outcomes per trial (defective or not), and a constant probability of "success" (p=0.02). The Poisson could also work as an approximation (since n is large and p is small), but the binomial is the exact model.

Question 6. The standard error of the mean is:

(A) The standard deviation of the population
(B) The standard deviation of the sample
(C) The standard deviation of the distribution of sample means
(D) The mean of the standard deviations across multiple samples

Answer

**Correct: (C)** The standard error (SE) is the standard deviation of the sampling distribution of the mean — it measures how much sample means vary from sample to sample. It equals sigma/sqrt(n), where sigma is the population standard deviation and n is the sample size. It is NOT the same as the population SD (A) or sample SD (B).

Question 7. What does stats.norm.ppf(0.975) compute?

(A) The probability that Z < 0.975
(B) The z-value that has 97.5% of the distribution below it
(C) The PDF at x = 0.975
(D) 97.5% of the standard deviation

Answer

**Correct: (B)** `ppf` stands for "percent point function" — it's the inverse of the CDF. `stats.norm.ppf(0.975)` returns the z-value such that 97.5% of the standard normal distribution falls below it. The answer is approximately 1.96 — a number you'll see frequently in confidence intervals ([Chapter 22](../chapter-22-sampling-estimation/index.md)).

Question 8. The Poisson distribution is most appropriate for modeling:

(A) Test scores in a large class
(B) Whether each of 50 patients recovers (yes/no)
(C) The number of phone calls arriving at a call center per hour
(D) Heights of adult men

Answer

**Correct: (C)** The Poisson distribution models the count of events occurring in a fixed interval (time, space, etc.) at a constant average rate. Phone calls per hour is a classic Poisson scenario. (A) and (D) are better modeled by the normal distribution. (B) is a binomial setting (fixed n trials with yes/no outcomes).

Question 9. If you increase the sample size from 25 to 100 (quadrupled), the standard error of the mean:

(A) Is cut in half
(B) Is cut to one-quarter
(C) Stays the same
(D) Doubles

Answer

**Correct: (A)** The standard error is sigma/sqrt(n). If n goes from 25 to 100, sqrt(n) goes from 5 to 10 — it doubles. Since SE = sigma/sqrt(n), the SE is cut in half. To cut the SE in half, you need to quadruple the sample size. This has practical implications: each doubling of precision requires four times as much data.

Question 10. A normal distribution with mean 0 and standard deviation 1 is called:

(A) The unit normal
(B) The standard normal
(C) The reference normal
(D) The base normal

Answer

**Correct: (B)** The standard normal distribution, N(0,1), is the reference version of the normal distribution. All other normal distributions can be converted to the standard normal using z-scores: z = (x - mean) / std. This standardization allows us to use a single set of probability tables (or `stats.norm.cdf()` with default parameters) for any normal distribution.

Section 2: True/False (3 questions, 5 points each)

Question 11. True or False: The height of the PDF at a particular point gives the probability of that exact value.

Answer

**False.** For continuous distributions, the probability of any exact value is zero. The PDF gives the *density*, not the probability. Probabilities are obtained by computing areas under the PDF curve (i.e., integrating). P(a < X < b) = the area under the PDF from a to b = CDF(b) - CDF(a).

Question 12. True or False: The Central Limit Theorem only works if the population is normally distributed.

Answer

**False.** The whole point of the CLT is that it works for ANY population shape — uniform, skewed, bimodal, anything — as long as the population has a finite mean and variance. The CLT says that sample *means* will be approximately normal regardless of the population's shape, provided the sample size is large enough.

Question 13. True or False: For the normal distribution, the mean, median, and mode are all equal.

Answer

**True.** The normal distribution is perfectly symmetric. The peak of the bell curve (mode) is at the center, which is also the mean and the median. This is a defining property of the normal distribution and is not generally true for other distributions.

Section 3: Short Answer (3 questions, 5 points each)

Question 14. Explain the difference between the PDF and the CDF. When would you use each one?

Answer

The **PDF** (Probability Density Function) shows where probability is concentrated — it tells you which values are more or less likely. You'd use it to visualize the shape of a distribution or to understand the relative likelihood of different values. The **CDF** (Cumulative Distribution Function) shows the cumulative probability up to each point — P(X <= x). You'd use it to answer questions like "what percentage of values are below 70?" or to find percentiles. In practice, the CDF is used more for calculations (computing probabilities of ranges via subtraction), while the PDF is used more for visualization.

Question 15. Why does the standard error decrease as sample size increases? Explain intuitively (not just the formula).

Answer

Intuitively, larger samples include more of the population's diversity. In a small sample, you might by chance get a cluster of high values or low values, leading to a mean that's far from the truth. In a larger sample, the high and low values tend to balance each other out, so the sample mean is more consistently close to the population mean. It's the law of large numbers in action — with more data, the random noise averages out and your estimate becomes more stable.

Question 16. Explain when you should NOT use the normal distribution to model your data. Give two specific examples.

Answer

You should not use the normal distribution when: 1. **The data is bounded or skewed.** For example, income is always positive and typically right-skewed — a normal model would assign positive probability to negative incomes, which is impossible. A log-normal distribution is more appropriate. 2. **The data is discrete counts.** For example, the number of customers per day can only be 0, 1, 2, ... — a Poisson or binomial distribution is more appropriate. The normal distribution is continuous and extends to negative infinity, which doesn't make sense for counts. Other valid examples: proportions bounded between 0-1, bimodal data, or data with heavy tails.

Section 4: Applied Scenarios (2 questions, 7.5 points each)

Question 17. Elena fits a normal distribution to measles vaccination rates for high-income countries and finds mean = 93% and standard deviation = 4%.

(a) What percentage of high-income countries have vaccination rates below 85%? (b) What percentage have rates above 97%? (c) Elena's county has a rate of 88%. What is its z-score, and what does it mean? (d) Elena knows vaccination rates are bounded above by 100%. Why might the normal distribution be a questionable model here, and what would you look for in a Q-Q plot to check?

Answer

(a) z = (85-93)/4 = -2.0. P(X < 85) = stats.norm.cdf(-2.0) ≈ 0.0228, or about 2.3%. (b) z = (97-93)/4 = 1.0. P(X > 97) = 1 - stats.norm.cdf(1.0) ≈ 0.1587, or about 15.9%. (c) z = (88-93)/4 = -1.25. This means Elena's county is 1.25 standard deviations below the mean for high-income countries — below average but not extreme. (d) The normal distribution extends infinitely in both directions, but vaccination rates can't exceed 100%. With mean 93% and SD 4%, the normal model assigns noticeable probability to rates above 100% (P(X > 100) ≈ 4%), which is impossible. In a Q-Q plot, you'd see the right tail bending downward (values are "squished" against the 100% ceiling), indicating left skew or a ceiling effect. A beta distribution might be more appropriate.

Question 18. Marcus records daily customer counts at his bakery for a year (365 days). The mean is 85 customers per day with a standard deviation of 15.

(a) If he takes a random week (7 days) and computes the average, what is the standard error of that weekly average? (b) If he takes a random month (30 days), what is the standard error? (c) He samples 4 random weeks and gets weekly averages of 79, 91, 83, and 88. Is this variability consistent with what the CLT predicts? (d) One day, he has 130 customers. What is this day's z-score? Is it unusual?

Answer

(a) SE = 15 / sqrt(7) = 15 / 2.65 ≈ 5.7 customers. (b) SE = 15 / sqrt(30) = 15 / 5.48 ≈ 2.7 customers. (c) The CLT predicts weekly averages should cluster within about 2 SEs of the mean: 85 +/- 2(5.7) = 73.6 to 96.4. All four averages (79, 91, 83, 88) fall within this range. The standard deviation of these four values is about 5.2, close to the predicted SE of 5.7. Yes, this is consistent. (d) z = (130 - 85) / 15 = 3.0. This is 3 standard deviations above the mean — very unusual (only about 0.13% probability). Marcus should investigate: was there a special event, holiday, or error?

Section 5: Code Analysis (2 questions, 5 points each)

Question 19. What does this code produce, and what principle does it demonstrate?

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
population = np.random.exponential(10, 100000)
sample_means = [np.mean(np.random.choice(population, 50)) for _ in range(5000)]

plt.hist(sample_means, bins=50, density=True)
plt.title(f'Mean of means: {np.mean(sample_means):.2f}, SD of means: {np.std(sample_means):.2f}')
plt.show()

Answer

This code demonstrates the **Central Limit Theorem**. It draws from an exponential population (which is strongly right-skewed) and computes the mean of 5000 random samples, each of size 50. The resulting histogram of sample means will be approximately **bell-shaped (normal)**, despite the population being far from normal. The mean of sample means will be approximately 10 (the population mean), and the standard deviation of sample means will be approximately 10/sqrt(50) ≈ 1.41 (the standard error predicted by the CLT). This illustrates that sample means are approximately normal regardless of the population shape — the core message of the Central Limit Theorem.

Question 20. The following code uses scipy.stats. What values does it print?

from scipy import stats

dist = stats.norm(loc=100, scale=20)

print(f"A: {dist.cdf(100):.2f}")
print(f"B: {dist.cdf(120):.4f}")
print(f"C: {dist.ppf(0.025):.1f}")
print(f"D: {1 - dist.cdf(140):.4f}")

Answer

A: 0.50
B: 0.8413
C: 60.8
D: 0.0228

Explanation: - **A:** P(X <= 100) = 0.50 because 100 is the mean of the distribution. By symmetry, exactly half the distribution is below the mean. - **B:** P(X <= 120). 120 is one standard deviation above the mean (z=1). CDF at z=1 ≈ 0.8413. - **C:** The value at the 2.5th percentile. This is the value such that 2.5% of the distribution is below it. z = -1.96, so x = 100 + (-1.96)(20) = 60.8. - **D:** P(X > 140). 140 is two standard deviations above the mean (z=2). P(X > z=2) ≈ 0.0228 (2.28%).