Exercises: Sampling Distributions and the Central Limit Theorem

Contributors

Exercises: Sampling Distributions and the Central Limit Theorem

These exercises progress from concept checks through applied CLT analysis. Estimated completion time: 3 hours.

Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)

Part A: Conceptual Understanding ⭐

A.1. In your own words, explain the difference between a sample distribution and a sampling distribution. Why is this distinction important?

A.2. A classmate says: "The Central Limit Theorem says that if you collect a large enough sample, the data in your sample will be normally distributed." What's wrong with this statement? Rewrite it correctly.

A.3. Explain the three guarantees of the Central Limit Theorem: what does the CLT tell us about the (a) shape, (b) center, and (c) spread of the sampling distribution of $\bar{x}$?

A.4. A population has mean $\mu = 50$ and standard deviation $\sigma = 12$. Without doing any calculations, answer: (a) What is the mean of the sampling distribution of $\bar{x}$ for samples of size 36? (b) Will the standard error be larger than, smaller than, or equal to 12? Why? (c) Will the sampling distribution be approximately normal? Under what conditions?

A.5. Explain in your own words why the formula for standard error has $\sqrt{n}$ in the denominator instead of just $n$. What does this imply about the "diminishing returns" of increasing sample size?

A.6. True or false (explain each): (a) The sampling distribution of $\bar{x}$ is always exactly normal. (b) The standard error decreases as the sample size increases. (c) If the population is normal, the sampling distribution of $\bar{x}$ is normal for any sample size. (d) The Central Limit Theorem applies only to means, not proportions. (e) To halve the standard error, you must double the sample size.

A.7. Explain the difference between the law of large numbers and the Central Limit Theorem. Both involve what happens as sample size increases — how do they differ in what they tell us?

Part B: Standard Error Calculations ⭐⭐

B.1. A population has $\mu = 80$ and $\sigma = 20$. Calculate the standard error of the mean for: (a) $n = 16$ (b) $n = 64$ (c) $n = 256$ (d) What pattern do you notice? Express it as a general rule.

B.2. Sam Okafor is analyzing the Riverside Raptors' scoring data. The team's points per game over the past five seasons have a mean of $\mu = 105$ and standard deviation of $\sigma = 12$.

(a) If Sam takes a random sample of 9 games, what is the standard error of the mean? (b) If he takes a sample of 36 games, what is the standard error? (c) How many games would he need to sample to achieve a standard error of 1 point? (d) Is the answer to (c) practically achievable? The team plays about 82 games per season.

B.3. Alex Rivera's StreamVibe data shows that daily watch time for users has a mean of 52 minutes and a standard deviation of 24 minutes.

(a) For a random sample of 100 users, what is the standard error of the mean watch time? (b) For a random sample of 400 users? (c) How many users would Alex need to sample to have a standard error of 1 minute? (d) StreamVibe has 2 million active users. Does the 10% condition matter here? Explain.

B.4. Dr. Maya Chen is planning a blood pressure study. She knows from prior research that systolic blood pressure has $\sigma \approx 18$ mmHg. She wants the standard error to be no more than 2 mmHg.

(a) What is the minimum sample size she needs? (b) If she decides she wants the SE to be no more than 1 mmHg instead, how does the required sample size change? (c) Explain why the answer to (b) isn't simply double the answer to (a).

B.5. A political poll wants to estimate the proportion of voters who support a ballot measure. The pollster expects the true proportion to be around 0.60.

(a) Calculate the standard error for $n = 400$. (b) Calculate the standard error for $n = 1{,}600$. (c) By what factor did the standard error decrease when the sample size was quadrupled? (d) What sample size would achieve $\text{SE} = 0.01$ (one percentage point)?

Part C: Applying the CLT ⭐⭐

C.1. The lengths of fish in a lake have $\mu = 32$ cm and $\sigma = 8$ cm. A biologist catches a random sample of 64 fish.

(a) Describe the sampling distribution of $\bar{x}$ (shape, center, spread). (b) What is the probability that the sample mean exceeds 33 cm? (c) What is the probability that the sample mean is between 31 and 33 cm? (d) What sample mean would be so large that only 5% of samples would exceed it?

C.2. The average commute time in a large city is $\mu = 38$ minutes with $\sigma = 15$ minutes. The distribution is right-skewed (some people have very long commutes).

(a) Is the population distribution normal? Does that matter for answering the following questions? (b) For a random sample of 50 commuters, find $P(\bar{x} > 42)$. (c) For a random sample of 50 commuters, find $P(35 < \bar{x} < 41)$. (d) Would your answers change if the sample size were 10 instead of 50? Why?

C.3. Sam learns that the league average three-point percentage is 35.8% with a standard deviation of 4.2 percentage points across all players. If Sam randomly selects 40 players:

(a) What is the standard error of the sample mean three-point percentage? (b) What is the probability that the sample mean exceeds 37%? (c) Between what two values does the middle 95% of possible sample means fall? (Hint: use $z = \pm 1.96$.)

C.4. A quality control engineer at a battery factory knows that battery lifetimes are normally distributed with $\mu = 500$ hours and $\sigma = 40$ hours. She tests a random sample of 25 batteries.

(a) What is the probability that the sample mean lifetime exceeds 510 hours? (b) What is the probability that the sample mean is within 5 hours of the true mean (between 495 and 505)? (c) How many batteries should she test to be 95% sure the sample mean is within 5 hours of the true mean? (Hint: Set $1.96 \times \text{SE} = 5$ and solve for $n$.)

C.5. In a recent election, 48% of voters supported Candidate A. A polling organization took a random sample of 900 voters.

(a) What is the standard error of the sample proportion? (b) Verify that the CLT conditions for proportions are met. (c) What is the probability that the sample proportion exceeds 0.50 (incorrectly suggesting Candidate A has majority support)? (d) What sample size would reduce the standard error to 0.01?

Part D: Simulation and Python ⭐⭐⭐

D.1. Write Python code to demonstrate the CLT using a uniform distribution on $[0, 100]$. Take 10,000 samples each of size $n = 2$, $n = 10$, $n = 30$, and $n = 100$. For each sample size: (a) Plot the sampling distribution of $\bar{x}$. (b) Report the mean, standard deviation, and skewness of the sampling distribution. (c) Overlay a normal curve with the theoretical mean and SE. (d) At what sample size does the sampling distribution first look approximately normal?

D.2. Repeat D.1 using a strongly right-skewed population. Create the population using np.random.exponential(scale=10, size=100_000). (a) How does the required sample size for approximate normality compare to the uniform case? (b) Explain why more skewed populations require larger sample sizes.

D.3. Write Python code to verify the standard error formula. Generate a population with known $\mu$ and $\sigma$. For sample sizes $n = 10, 25, 50, 100, 200, 500$: (a) Take 10,000 samples and compute the standard deviation of the sample means. (b) Compare this observed SE to the theoretical SE $\sigma / \sqrt{n}$. (c) Create a plot showing both the theoretical and observed SE curves. How closely do they match?

D.4. Daria's shooting simulation. Write Python code to simulate Sam's analysis of Daria's improvement.

(a) Assume Daria's true three-point percentage is $p = 0.31$ (no improvement). Simulate 10,000 sets of 65 attempts and record the proportion of makes in each. (b) Plot the sampling distribution of $\hat{p}$. Does it look approximately normal? (c) What proportion of simulations produced $\hat{p} \geq 0.38$? (d) Compare your simulated answer to the theoretical answer from the CLT (using the normal approximation). (e) Now repeat with $p = 0.35$ (slight improvement). How does the result change?

D.5. The CLT doesn't fix bad sampling. This exercise demonstrates that the CLT requires random sampling.

(a) Create a population where the first half has values drawn from $N(30, 5)$ and the second half from $N(70, 5)$. (b) Take 5,000 "biased" samples of size 50 that preferentially sample from the first half (e.g., 90% of each sample comes from the first half). Compute the mean of each sample. (c) Take 5,000 truly random samples of size 50. Compute the mean of each. (d) Plot both sampling distributions. Does the CLT "fix" the biased sampling? Explain why or why not.

Part E: Critical Thinking and Applications ⭐⭐⭐

E.1. (Maya's Study Design) Dr. Maya Chen needs to estimate the average systolic blood pressure in her county. Prior research suggests $\sigma \approx 18$ mmHg.

(a) She wants the standard error to be at most 3 mmHg. What's the minimum sample size? (b) Her budget allows her to test 100 people. What standard error will she achieve? (c) A colleague suggests she could reduce variability by only studying adults aged 40-60. Would this affect $\sigma$, $n$, or both? How would it change the standard error? (d) Explain why reducing $\sigma$ (by studying a more homogeneous population) is an alternative to increasing $n$ for achieving a smaller standard error.

E.2. (Alex's A/B Test) Alex's A/B test has 2,000 users in each group. The control group's mean watch time is 48 minutes (SD = 22 minutes) and the treatment group's mean is 52 minutes (SD = 24 minutes).

(a) Calculate the standard error of the mean for each group. (b) The standard error of the difference in means is approximately $\sqrt{\text{SE}_1^2 + \text{SE}_2^2}$. Calculate it. (c) How many standard errors is the observed difference (4 minutes) from zero? (d) Based on your answer to (c), does the difference seem likely due to chance? (Use the rough guideline that >2 SE is unlikely by chance.)

E.3. (Professor Washington's Analysis) Professor Washington is analyzing recidivism rates. The national recidivism rate is 44%. In a sample of 200 defendants who went through a new rehabilitation program, 76 (38%) reoffended.

(a) What is the standard error for the proportion, assuming $p = 0.44$? (b) How many standard errors below 0.44 is the observed proportion of 0.38? (c) Is this result strong evidence that the program reduces recidivism? Why or why not? (d) What sample size would Washington need to detect a reduction from 44% to 38% with more certainty (make the z-score at least 2.5)?

E.4. (Sample Size Planning) A researcher plans to survey college students about their weekly study hours. She expects $\sigma \approx 8$ hours.

(a) Create a table showing the standard error for $n = 25, 50, 100, 200, 400, 800$. (b) Plot SE vs. $n$. At what point does increasing $n$ yield "diminishing returns" in your judgment? (c) If each survey costs $5 per student, what's the cost to achieve SE = 1 hour? SE = 0.5 hours? SE = 0.25 hours? (d) Write a brief paragraph advising the researcher on the tradeoff between precision and cost.

E.5. (Conceptual Integration) Consider two populations: - Population A: Normal with $\mu = 50$, $\sigma = 10$ - Population B: Strongly right-skewed with $\mu = 50$, $\sigma = 10$

(a) For samples of size $n = 5$, will the sampling distributions of $\bar{x}$ be the same for both populations? Explain. (b) For samples of size $n = 100$, will they be approximately the same? Explain. (c) What does this tell you about when the population shape matters and when it doesn't?

Part M: Mixed Practice (Cumulative Review) ⭐⭐

M.1. A consumer group surveys 64 randomly selected packages of a cereal brand. The labeled weight is 16 oz. Their sample has $\bar{x} = 15.8$ oz with $s = 0.6$ oz.

(a) Calculate the estimated standard error. (b) How many standard errors below 16 oz is the sample mean? (c) If the company's filling process truly averages 16 oz, how unusual is this result? (d) What type of study is this — observational or experimental? (Connects to Ch.4)

M.2. The distribution of household incomes in a city is right-skewed with $\mu = \$65{,}000$ and $\sigma = \$30{,}000$.

(a) Can we use the normal distribution to find $P(X > \$100{,}000)$ for a randomly selected household? Why or why not? (Connects to Ch.10) (b) Can we use the normal distribution to find $P(\bar{x} > \$70{,}000)$ for a random sample of 100 households? Explain. (c) Calculate $P(\bar{x} > \$70{,}000)$ for $n = 100$.

M.3. Sam takes a random sample of 40 Riverside Raptors games and records total points scored. He gets $\bar{x} = 108$ and $s = 14$.

(a) What is the estimated standard error? (b) The league average is 103 points per game. How many standard errors above the league average is the Raptors' sample mean? (c) Is this result consistent with the Raptors being an average team? (Connects to Ch.6: z-scores) (d) What additional information would help Sam make a stronger conclusion? (Preview of Ch.13)

M.4. A website reports that its users spend an average of 7.2 minutes per visit. You suspect this is inflated because the estimate is based on only 12 visits.

(a) Explain why a small sample size alone doesn't make the estimate wrong — it makes it imprecise. (b) If the true standard deviation is 4 minutes, what is the SE for $n = 12$? For $n = 120$? (c) The website also doesn't describe how visits were selected. Why does this matter more than sample size? (Connects to Ch.4: sampling bias)

M.5. In Chapter 8, you learned that the probability of rolling doubles with two dice is $1/6 \approx 0.167$. You roll two dice 180 times and observe doubles 36 times ($\hat{p} = 0.200$).

(a) Calculate the standard error assuming $p = 1/6$. (b) How many standard errors from $p = 1/6$ is your observed proportion? (c) Is your result consistent with fair dice? (Connects to Ch.8: probability) (d) Verify the CLT conditions for proportions are met.