Chapter 21 Exercises: Distributions and the Normal Curve

Contributors to Introduction to Data Science

Chapter 21 Exercises: Distributions and the Normal Curve

How to use these exercises: Each section builds in complexity. For simulation exercises, always compare your simulated results with analytical (formula-based) results when possible — the agreement between them builds confidence in both approaches.

Difficulty key: Foundational | Intermediate | Advanced | Extension

Part A: Conceptual Understanding (Foundational)

Exercise 21.1 — Discrete vs. continuous

Classify each of the following random variables as discrete or continuous. For each, name a distribution that might be appropriate.

(a) The number of emails you receive in an hour (b) The height of a randomly selected adult (c) The number of heads in 20 coin flips (d) The time until the next bus arrives (e) The number of customers who buy a product today (f) A student's GPA

Guidance

(a) Discrete — Poisson distribution (counting events in a fixed interval). (b) Continuous — Normal distribution (symmetric, bell-shaped). (c) Discrete — Binomial distribution (fixed trials, two outcomes). (d) Continuous — Exponential distribution (time until an event). (e) Discrete — Poisson or Binomial depending on whether there's a fixed population. (f) Technically continuous (can take values like 3.47), though in practice it's computed from discrete grades. Could be modeled as continuous.

Exercise 21.2 — The empirical rule

The scores on a standardized exam are normally distributed with a mean of 500 and a standard deviation of 100.

(a) What range contains the middle 68% of scores? (b) What percentage of students score above 700? (c) What percentage score between 300 and 700? (d) A score of 200 is how many standard deviations below the mean? Is this score unusual? (e) If 10,000 students take the exam, approximately how many score above 800?

Guidance

(a) 400 to 600 (mean plus/minus 1 SD). (b) 700 is 2 SDs above the mean. About 2.5% score above 700 (half of the 5% outside 2 SDs). (c) 300-700 is mean plus/minus 2 SDs = about 95%. (d) z = (200-500)/100 = -3. Three standard deviations below — very unusual (only about 0.15% of scores are this low). (e) 800 is 3 SDs above. About 0.15% score above 800. 10,000 * 0.0015 ≈ 15 students.

Exercise 21.3 — PDF vs. CDF

(a) In your own words, explain what the PDF tells you and what the CDF tells you. (b) If the PDF is tall at a particular x-value, what does that mean? (c) The CDF at x=70 is 0.84. What does this mean in words? (d) How do you use the CDF to find P(60 < X < 80)? (e) For a continuous distribution, why is P(X = exactly 3.7) equal to zero?

Guidance

(a) PDF tells you where probability is concentrated (density). CDF tells you the cumulative probability up to a given point — P(X <= x). (b) It means probability is concentrated there — values near x are more likely than values where the PDF is low. But the height itself is NOT a probability. (c) 84% of the values are less than or equal to 70. (d) P(60 < X < 80) = CDF(80) - CDF(60). (e) Because for a continuous variable, there are infinitely many possible values. The probability of hitting any exact value is zero — probabilities only exist for ranges (intervals).

Exercise 21.4 — Z-score interpretation

Jordan scored 78 on a test where the mean was 72 and the standard deviation was 8. Priya scored 85 on a different test where the mean was 80 and the standard deviation was 10.

(a) Compute z-scores for both. (b) Who performed better relative to their class? (c) Assuming normal distributions, what percentile is each student at? (d) What score would Jordan need to be at the 95th percentile?

Guidance

(a) Jordan: z = (78-72)/8 = 0.75. Priya: z = (85-80)/10 = 0.50. (b) Jordan performed better relative to their class (higher z-score). (c) Jordan: stats.norm.cdf(0.75) ≈ 0.773 → 77th percentile. Priya: stats.norm.cdf(0.50) ≈ 0.691 → 69th percentile. (d) The 95th percentile z-score is 1.645. Score = 72 + 1.645 * 8 = 85.2.

Exercise 21.5 — CLT in plain English

Explain the Central Limit Theorem to a friend who hasn't taken a statistics class. Use an analogy or example. Make sure to address these points: (a) Does the population need to be normal for the CLT to work? (b) What role does sample size play? (c) Why is this theorem important for surveys and polls?

Guidance

No, the population does NOT need to be normal — that's the magic. Even if individual values follow a weird distribution, the average of a sample tends toward a normal shape. Larger samples produce better bell curves. This matters for polls because it means we can estimate population characteristics from samples and quantify our uncertainty — even if the underlying opinions are not normally distributed.

Exercise 21.6 — Matching distributions

Match each scenario with the most appropriate distribution:

Scenario	Distribution
Height of adult women
Number of typos on a page
Whether each of 50 people buy a product (yes/no)
Temperature tomorrow (in degrees)
Random number between 1 and 100
Number of cars passing a toll booth per hour

Options: Normal, Binomial, Poisson, Uniform

Guidance

Height → Normal. Typos → Poisson. 50 purchases → Binomial. Temperature → Normal. Random number 1-100 → Uniform. Cars per hour → Poisson.

Exercise 21.7 — Reading Q-Q plots

Describe what the following Q-Q plot patterns indicate about the data: (a) Points fall exactly on the diagonal line (b) Points curve upward at the right end and downward at the left end (c) Points curve upward at both ends (d) Points follow the line in the middle but deviate at both tails

Guidance

(a) Data is approximately normal. (b) Data is right-skewed — the right tail is heavier than normal. (c) Data has heavier tails than normal (leptokurtic) — more extreme values in both directions. (d) Same as (c) — heavy tails. This is common in financial data.

Exercise 21.8 — Binomial in context

A vaccine is 88% effective. In a group of 200 vaccinated people exposed to the disease: (a) What is the expected number who develop immunity? (b) What is the standard deviation? (c) What is the probability that fewer than 170 develop immunity? (d) Is it reasonable to approximate this with a normal distribution? Why?

Guidance

(a) E(X) = n*p = 200 * 0.88 = 176. (b) SD = sqrt(n*p*(1-p)) = sqrt(200 * 0.88 * 0.12) = sqrt(21.12) ≈ 4.6. (c) Use stats.binom.cdf(169, 200, 0.88) ≈ 0.067 (about 6.7%). (d) Yes — when n*p and n*(1-p) are both > 5, the normal approximation to the binomial is reasonable. Here n*p = 176 and n*(1-p) = 24, both well above 5.

Part B: Applied Coding (Intermediate)

Exercise 21.9 — scipy.stats exploration

Using scipy.stats.norm, compute the following for a normal distribution with mean 100 and standard deviation 15: (a) P(X < 85) (b) P(X > 120) (c) P(90 < X < 110) (d) The value at the 25th percentile (e) The value at the 99th percentile (f) Generate 1000 random samples and verify that the sample mean is close to 100

from scipy import stats
dist = stats.norm(100, 15)
# Use dist.cdf(), dist.ppf(), dist.rvs() to answer each part

Exercise 21.10 — Distribution fitting

Generate 1000 samples from each of these distributions: normal, exponential, and uniform. For each: (a) Plot a histogram (b) Overlay the theoretical PDF (c) Create a Q-Q plot against the normal distribution (d) Run the Shapiro-Wilk test

Present all results in a 3x3 grid of subplots.

Exercise 21.11 — CLT demonstration

Write a function demonstrate_clt(population, sample_sizes, n_samples) that: (a) Takes any population array and a list of sample sizes (b) For each sample size, draws n_samples random samples and computes their means (c) Plots the distribution of sample means for each sample size (d) Overlays the theoretical normal curve predicted by the CLT (e) Prints the mean and standard error for each sample size

Test it with three different population shapes: uniform, exponential, and bimodal.

Guidance

def demonstrate_clt(population, sample_sizes, n_samples=5000):
    pop_mean = np.mean(population)
    pop_std = np.std(population)

    fig, axes = plt.subplots(1, len(sample_sizes), figsize=(5*len(sample_sizes), 4))
    for ax, n in zip(axes, sample_sizes):
        means = [np.mean(np.random.choice(population, n)) for _ in range(n_samples)]
        se = pop_std / np.sqrt(n)
        ax.hist(means, bins=40, density=True, alpha=0.7)
        x = np.linspace(min(means), max(means), 100)
        ax.plot(x, stats.norm.pdf(x, pop_mean, se), 'r-', linewidth=2)
        ax.set_title(f'n={n}')
    plt.tight_layout()
    plt.show()

Exercise 21.12 — Binomial simulation vs. formula

A fair coin is flipped 100 times. Using both simulation (flip 100 coins 10,000 times) and scipy.stats.binom: (a) Estimate P(exactly 50 heads) (b) Estimate P(45 or fewer heads) (c) Estimate P(between 45 and 55 heads, inclusive) (d) Compare the binomial probabilities to the normal approximation N(50, 5)

Exercise 21.13 — Poisson modeling

Marcus's bakery averages 42 customers per hour during the morning rush (7-9 AM). (a) Model this using a Poisson distribution. What's the probability of getting 50+ customers in a given hour? (b) What's the probability of getting fewer than 30? (c) Simulate 1000 hours and compare the simulated distribution to the theoretical Poisson (d) Marcus can handle up to 55 customers per hour before service quality drops. How often will this happen?

Exercise 21.14 — Z-score probability calculator

Write a function normal_probability(mean, std, lower=None, upper=None) that computes: - P(X < upper) if only upper is given - P(X > lower) if only lower is given - P(lower < X < upper) if both are given

It should also print a plain-English interpretation. Test with real-world examples.

Guidance

def normal_probability(mean, std, lower=None, upper=None):
    dist = stats.norm(mean, std)
    if lower is not None and upper is not None:
        prob = dist.cdf(upper) - dist.cdf(lower)
        print(f"P({lower} < X < {upper}) = {prob:.4f} ({prob*100:.1f}%)")
    elif upper is not None:
        prob = dist.cdf(upper)
        print(f"P(X < {upper}) = {prob:.4f} ({prob*100:.1f}%)")
    elif lower is not None:
        prob = 1 - dist.cdf(lower)
        print(f"P(X > {lower}) = {prob:.4f} ({prob*100:.1f}%)")
    return prob

Exercise 21.15 — Standard error exploration

Using simulation, demonstrate that the standard error of the mean equals sigma/sqrt(n): (a) Generate a population with known sigma (b) For sample sizes n = 5, 10, 25, 50, 100, 500, take 5000 samples of each size (c) Compute the standard deviation of the 5000 sample means for each n (d) Plot the simulated standard errors alongside the theoretical sigma/sqrt(n) (e) How well do they match?

Exercise 21.16 — Comparing distributions visually

Create a figure showing four distributions on the same axes: N(0,1), N(0,2), N(3,1), and N(3,2). Use different colors and a legend. Explain how changing the mean shifts the curve and how changing the standard deviation changes the width.

Exercise 21.17 — Real data normality check

Load a real dataset (suggestions: seaborn's 'tips' dataset, or any CSV with numerical columns). Pick two numerical columns. For each: (a) Plot a histogram with normal overlay (b) Create a Q-Q plot (c) Run the Shapiro-Wilk test (d) Compute skewness and kurtosis (e) Conclude whether the normal distribution is a reasonable model

Exercise 21.18 — Inverse CDF problems

Using scipy.stats.norm.ppf(): (a) What IQ score puts you in the top 1%? (IQ: mean=100, SD=15) (b) What height is the 10th percentile for adult women? (Mean=64 inches, SD=2.5 inches) (c) A quality control process rejects items whose weight deviates by more than 2% from the target of 500g (SD=5g). What weight range is acceptable? (d) Find the z-score that marks the boundary of the top 5%

Part C: Synthesis and Real-World Application (Advanced)

Exercise 21.19 — Elena's complete distribution analysis

Using the progressive project vaccination data, perform a complete distribution analysis: (a) Fit normal distributions to vaccination rates for each income group (b) Create Q-Q plots for each group (c) For groups where normality holds, use the fitted normal to answer: "What percentage of countries in this income group have vaccination rates below 70%?" (d) For groups where normality doesn't hold, suggest a better model and explain why

Exercise 21.20 — When the CLT breaks down

The CLT requires the population to have a finite mean and variance. Find a distribution where these conditions are violated (hint: the Cauchy distribution, available as stats.cauchy). Take samples of increasing size and show that the distribution of sample means does NOT converge to normal. This is a fascinating edge case.

Guidance

cauchy_pop = stats.cauchy.rvs(size=1000000)
# The Cauchy distribution has no defined mean or variance!
# Sample means will NOT converge to normal
for n in [10, 100, 1000]:
    means = [np.mean(stats.cauchy.rvs(size=n)) for _ in range(5000)]
    # Plot and observe — the distribution stays heavy-tailed

Exercise 21.21 — Normal approximation to the binomial

The normal distribution can approximate the binomial when n is large. Demonstrate this: (a) Plot Binomial(n=50, p=0.3) as a bar chart (b) Overlay Normal(mean=np.500.3, std=sqrt(500.3*0.7)) as a curve (c) Compare specific probabilities (e.g., P(X <= 10), P(X > 20)) (d) At what value of n does the approximation become "good enough"? (Try n = 5, 10, 20, 50, 100)

Exercise 21.22 — Distribution mixture

Create a mixture of two normal distributions: 60% from N(30, 5) and 40% from N(60, 8). This simulates data from two distinct populations (like Elena's low-income and high-income vaccination rates). (a) Plot the mixture (b) Show that the mean of the mixture doesn't describe either group well (c) Show that a Q-Q plot reveals the non-normality (d) Compute the mean, median, and standard deviation of the mixture and discuss why these are misleading

Exercise 21.23 — Log-normal distribution

Many real-world quantities (income, city populations, stock prices) follow a log-normal distribution — meaning that the logarithm of the data is normally distributed. (a) Generate 10,000 samples from a log-normal distribution using np.random.lognormal() (b) Show that the data itself is right-skewed (c) Show that log(data) is approximately normal (histogram + Q-Q plot) (d) Explain why income might follow a log-normal distribution (hint: think about multiplicative growth)

Exercise 21.24 — Comparing populations with normal models

Two schools report SAT scores: School A has mean=1050, SD=200 and School B has mean=1120, SD=150. Assuming normality: (a) What percentage of School A students score above 1200? (b) What percentage of School B students score above 1200? (c) At what score does School B's proportion exceed School A's? (d) If you randomly select one student from each school, what's the probability the School A student scores higher? (This requires knowledge that the difference of two normals is also normal.)

Part D: Extension Problems (Challenge)

Exercise 21.25 — The multivariate normal

Extend your understanding to two dimensions. Generate 1000 points from a bivariate normal distribution with correlation 0.7. Plot the scatter plot. Add contour lines showing the density. Explain how the correlation affects the shape of the cloud.

Guidance

mean = [0, 0]
cov = [[1, 0.7], [0.7, 1]]
data = np.random.multivariate_normal(mean, cov, 1000)
plt.scatter(data[:, 0], data[:, 1], alpha=0.3)

Exercise 21.26 — Bootstrap distribution

The bootstrap is a simulation technique for estimating the distribution of a statistic. Given a sample, you resample WITH replacement many times and compute the statistic each time. (a) Take a sample of 50 values from an exponential distribution (b) Resample with replacement 5000 times, computing the mean each time (c) Plot the bootstrap distribution of means (d) Compare it to the CLT prediction (e) Compute a 95% "bootstrap confidence interval" using the 2.5th and 97.5th percentiles

Exercise 21.27 — Power of normality tests

How good are normality tests at detecting non-normality? Generate samples from a slightly right-skewed distribution (e.g., chi-squared with 10 df). For sample sizes of 20, 50, 100, 500, and 2000, run the Shapiro-Wilk test 1000 times and record the proportion of times it rejects normality (at alpha=0.05). Plot the "detection rate" vs. sample size. What do you conclude about the power of normality tests?

Exercise 21.28 — Build a distribution explorer

Create an interactive (or at least parameterized) function that lets you explore different distributions. The function should accept a distribution name and parameters, and produce: a PDF plot, a CDF plot, random samples histogram, the mean, standard deviation, skewness, and kurtosis. Support at least: normal, uniform, exponential, binomial, and Poisson.

Exercise 21.29 — The German Tank Problem

In World War II, statisticians estimated the total number of German tanks by looking at the serial numbers on captured tanks. If you capture tanks with serial numbers 32, 85, 17, 93, and 41, how would you estimate the total number of tanks produced? The frequentist estimate is max(serial numbers) + max/n - 1. Simulate this problem: generate serial numbers from 1 to N (unknown), randomly "capture" some, and compare the estimate to the true N. How does the accuracy depend on sample size?

Exercise 21.30 — Your own distribution analysis

Find a real-world dataset (or use one from your progressive project) and perform a complete distribution analysis: (a) Identify the variable and describe what it measures (b) Plot the distribution (histogram + KDE) (c) Check normality (Q-Q plot + Shapiro-Wilk) (d) If not normal, identify the distribution family that best fits (try several using scipy.stats.fit()) (e) Use the fitted distribution to compute a meaningful probability (f) Write up your analysis in 300-500 words

Reflection

After completing these exercises, you should be comfortable with:

[ ] Distinguishing discrete and continuous distributions
[ ] Using scipy.stats to compute PDF, CDF, inverse CDF, and random samples
[ ] Applying the empirical rule (68-95-99.7) for normal distributions
[ ] Using z-scores to compare values across different scales
[ ] Explaining the Central Limit Theorem and demonstrating it through simulation
[ ] Checking normality with Q-Q plots and the Shapiro-Wilk test
[ ] Knowing when the normal distribution is and isn't appropriate

If all of these feel solid, you have a strong foundation for the inferential statistics of Chapters 22-23. The concepts of sampling distributions, standard error, and the CLT are the engine that powers confidence intervals and hypothesis tests.