Quiz: Numerical Summaries — Center, Spread, and Shape
Test your understanding before moving on. Target: 70% or higher to proceed confidently.
Section 1: Multiple Choice (1 point each)
1. Which measure of center is most affected by an extreme outlier?
- A) Median
- B) Mode
- C) Mean
- D) All three are equally affected
Answer
**C)** Mean. *Why C:* The mean uses every value in its calculation (sum / count), so one extreme value can pull it dramatically. A single billionaire in a room of teachers would skew the mean income upward. *Why not A:* The median is the middle value when sorted — it's resistant to outliers because it depends on position, not magnitude. *Why not B:* The mode is simply the most frequent value — an extreme outlier is typically unique and won't affect which value appears most. *Why not D:* The median and mode are both resistant; only the mean is sensitive to outliers. *Reference:* Section 6.12. A dataset has a mean of $85,000 and a median of $62,000. What is the most likely shape of this distribution?
- A) Symmetric
- B) Skewed left
- C) Skewed right
- D) Bimodal
Answer
**C)** Skewed right. *Why C:* When the mean is substantially larger than the median, a long tail on the right (high values) is pulling the mean above the median. This is the hallmark of a right-skewed distribution. Income data is the classic example. *Why not A:* In a symmetric distribution, the mean and median are approximately equal. *Why not B:* In a left-skewed distribution, the mean would be *less* than the median (pulled down by the left tail). *Why not D:* Bimodal distributions have two peaks, but this mean-median pattern specifically indicates right skew. *Reference:* Section 6.23. What does the standard deviation measure?
- A) The distance between the maximum and minimum values
- B) The middle value of the dataset
- C) The typical distance of values from the mean
- D) The most common value in the dataset
Answer
**C)** The typical distance of values from the mean. *Why C:* Standard deviation is the square root of the average of squared deviations from the mean — it captures how far a typical observation falls from the center. *Why not A:* That's the range (Max - Min). *Why not B:* That's the median. *Why not D:* That's the mode. *Reference:* Section 6.5 (Threshold Concept)4. The interquartile range (IQR) represents the spread of:
- A) All the data
- B) The upper 25% of the data
- C) The middle 50% of the data
- D) The lower 50% of the data
Answer
**C)** The middle 50% of the data. *Why C:* IQR = Q3 - Q1. Q1 marks the 25th percentile and Q3 marks the 75th percentile, so the distance between them captures the range of the middle 50% of observations. *Why not A:* The range (Max - Min) captures the spread of all the data. *Why not B:* The upper 25% goes from Q3 to the Max. *Why not D:* The lower 50% goes from the Min to the Median. *Reference:* Section 6.45. In the formula for sample variance, why do we divide by $n - 1$ instead of $n$?
- A) Because one data point is always lost during collection
- B) To correct for the tendency of the sample variance to underestimate the population variance
- C) Because the mean counts as one of the data points
- D) To make the calculation easier
Answer
**B)** To correct for the tendency of the sample variance to underestimate the population variance. *Why B:* When we use the sample mean $\bar{x}$ instead of the true population mean $\mu$, the sum of squared deviations is systematically smaller than it would be using $\mu$. Dividing by $n - 1$ compensates for this underestimate, making $s^2$ an unbiased estimator of $\sigma^2$. *Why not A:* No data point is "lost" — we use all $n$ values. *Why not C:* The mean is calculated from the data, but this isn't about counting it as a data point. *Why not D:* Dividing by $n$ would actually be simpler arithmetic. *Reference:* Section 6.56. According to the Empirical Rule, approximately what percentage of data falls within 2 standard deviations of the mean in a bell-shaped distribution?
- A) 68%
- B) 75%
- C) 95%
- D) 99.7%
Answer
**C)** 95%. *Why C:* The Empirical Rule states: 68% within 1 SD, 95% within 2 SDs, and 99.7% within 3 SDs. *Why not A:* 68% is within 1 standard deviation. *Why not B:* 75% is the minimum guaranteed by Chebyshev's inequality (for any distribution within 2 SDs), but the Empirical Rule gives a more precise answer for bell-shaped distributions. *Why not D:* 99.7% is within 3 standard deviations. *Reference:* Section 6.77. A z-score of -1.5 means:
- A) The value is 1.5 times the mean
- B) The value is 1.5 standard deviations below the mean
- C) The value is 1.5 standard deviations above the mean
- D) The value is 1.5 less than the standard deviation
Answer
**B)** The value is 1.5 standard deviations below the mean. *Why B:* The z-score formula is $z = (x - \bar{x})/s$. A negative z-score means the value is below the mean. The magnitude (1.5) tells you how many standard deviations away. *Why not A:* Z-scores measure distance in standard deviation units, not multiples of the mean. *Why not C:* A positive z-score would be above the mean; the negative sign means below. *Why not D:* Z-scores are not about subtracting from the standard deviation. *Reference:* Section 6.88. Which of the following is a resistant measure of spread?
- A) Range
- B) Standard deviation
- C) Variance
- D) IQR
Answer
**D)** IQR. *Why D:* The IQR is based on Q1 and Q3 — positions in the sorted data — so it's not affected by extreme values at either end. It's resistant. *Why not A:* The range depends entirely on the two most extreme values — one outlier can dramatically change it. *Why not B:* Standard deviation uses the mean in its calculation and squares deviations, giving extra weight to extreme values. Not resistant. *Why not C:* Variance is the square of standard deviation — equally non-resistant. *Reference:* Section 6.49. Using the IQR method, a value is flagged as a potential outlier if it falls:
- A) More than 1 standard deviation from the mean
- B) Beyond Q1 - 1.5 × IQR or Q3 + 1.5 × IQR
- C) Beyond the minimum or maximum
- D) More than 2 × IQR from the median
Answer
**B)** Beyond Q1 - 1.5 × IQR or Q3 + 1.5 × IQR. *Why B:* This is Tukey's fence method, used by box plots to flag potential outliers. The "fences" are set at 1.5 IQR below Q1 and 1.5 IQR above Q3. *Why not A:* That's too lenient — roughly 32% of data falls beyond 1 SD in a bell-shaped distribution. *Why not C:* By definition, no values can be beyond the minimum or maximum. *Why not D:* The IQR method uses Q1 and Q3, not the median, and multiplies by 1.5, not 2. *Reference:* Section 6.910. A box plot shows a box from 30 to 50 with the median line at 45. The left whisker extends to 15, and the right whisker extends to 55. There are two dots beyond the right whisker at 70 and 85. Which statement is true?
- A) The distribution is perfectly symmetric
- B) The distribution appears skewed right with two outliers
- C) The mean is definitely equal to 45
- D) The IQR is 35
Answer
**B)** The distribution appears skewed right with two outliers. *Why B:* The median (45) is closer to Q3 (50) than to Q1 (30), suggesting some asymmetry. The two dots beyond the right whisker are outliers flagged by the IQR method. The outliers on the right further indicate right skew. *Why not A:* The median is not centered in the box, and there are outliers on the right only. *Why not C:* Box plots show the median, not the mean. The outliers on the right would likely pull the mean above 45. *Why not D:* The IQR is Q3 - Q1 = 50 - 30 = 20, not 35. *Reference:* Section 6.6Section 2: Short Answer (2 points each)
11. The ages of 7 people in a study group are: 19, 20, 20, 21, 22, 23, 45. Calculate the mean and median. Which is a better measure of center for this group? Explain.
Answer
**Mean:** $(19 + 20 + 20 + 21 + 22 + 23 + 45) / 7 = 170 / 7 \approx 24.3$ **Median:** The 4th value in the sorted list = **21** The **median (21)** is a better measure of center. The 45-year-old is an outlier that pulls the mean up to 24.3, but six of the seven people are between 19 and 23. The median of 21 is more representative of the "typical" person in the group. The mean is inflated by the single older member. *Reference:* Sections 6.1, 6.212. A dataset has a mean of 100 and a standard deviation of 20. Using the Empirical Rule (assuming a bell-shaped distribution), find the interval that contains approximately 95% of the data.
Answer
95% of data falls within 2 standard deviations of the mean: $\bar{x} \pm 2s = 100 \pm 2(20) = 100 \pm 40$ **The interval is 60 to 140.** Approximately 95% of values in this distribution fall between 60 and 140. *Reference:* Section 6.713. Calculate the z-score for a value of 72 in a distribution with mean 80 and standard deviation 5. Interpret the result.
Answer
$z = \frac{x - \bar{x}}{s} = \frac{72 - 80}{5} = \frac{-8}{5} = -1.6$ **Interpretation:** The value of 72 is **1.6 standard deviations below the mean**. In a bell-shaped distribution, this is within 2 standard deviations, so it's somewhat below average but not extremely unusual. Roughly speaking, about 5-6% of values would be this far below the mean or further. *Reference:* Section 6.814. The five-number summary for a dataset is: Min = 5, Q1 = 20, Median = 35, Q3 = 50, Max = 95. Use the IQR method to determine whether the maximum value (95) is an outlier.
Answer
**Step 1:** IQR = Q3 - Q1 = 50 - 20 = 30 **Step 2:** Upper fence = Q3 + 1.5 × IQR = 50 + 1.5(30) = 50 + 45 = **95** **Step 3:** The maximum value is 95, which falls exactly *on* the upper fence, not beyond it. **Conclusion:** The maximum of 95 is **not an outlier** by the strict IQR method (an outlier must fall *beyond* the fence). However, it's right at the boundary — any value above 95 would be flagged. *Reference:* Section 6.915. Explain why you should report the median and IQR (instead of mean and standard deviation) for a dataset of home prices in a major city.
Answer
Home prices are typically **skewed right** — most homes are in a moderate price range, but a few luxury properties have extremely high prices. These extreme values pull the mean upward, making it larger than the median and unrepresentative of a "typical" home price. The standard deviation is also inflated by extreme values. The **median** resists the influence of luxury homes and represents the middle price — what a typical buyer would encounter. The **IQR** measures the spread of the middle 50% of prices, ignoring the extremes. Together, they provide a more accurate picture of the housing market for most buyers. This is the same logic from the income example in Section 6.2: for skewed data, resistant measures (median, IQR) beat non-resistant measures (mean, standard deviation). *Reference:* Sections 6.2, 6.11Section 3: Applied Problems (3 points each)
16. Sam Okafor is comparing three players' scoring consistency. Here are their last 8 games:
| Game | Player A | Player B | Player C |
|---|---|---|---|
| 1 | 20 | 15 | 30 |
| 2 | 22 | 18 | 10 |
| 3 | 19 | 20 | 25 |
| 4 | 21 | 22 | 5 |
| 5 | 20 | 25 | 35 |
| 6 | 23 | 20 | 15 |
| 7 | 18 | 28 | 20 |
| 8 | 21 | 12 | 20 |
- (a) Calculate the mean for each player.
- (b) Calculate the standard deviation for each player (you may use a calculator or Python).
- (c) Which player is most consistent? Which is least consistent?
- (d) If the team needs a reliable 20 points, which player should Sam recommend? Why?
Answer
**(a) Means:** - Player A: $(20+22+19+21+20+23+18+21)/8 = 164/8 = 20.5$ - Player B: $(15+18+20+22+25+20+28+12)/8 = 160/8 = 20.0$ - Player C: $(30+10+25+5+35+15+20+20)/8 = 160/8 = 20.0$ **(b) Standard deviations:** - Player A: $s \approx 1.60$ - Player B: $s \approx 5.07$ - Player C: $s \approx 10.00$ **(c)** Player A is most consistent (smallest SD = 1.60). Player C is least consistent (largest SD = 10.00). **(d)** Sam should recommend **Player A**. All three players average about 20 points, but Player A's standard deviation of 1.60 means nearly every game will be between about 18 and 23 points — very close to the 20-point target. Player C might score 30+ or might score only 5, making the outcome unpredictable. This illustrates the key insight: the mean tells you the expected performance, but the standard deviation tells you how reliable that expectation is. *Reference:* Sections 6.5, 6.1117. Dr. Maya Chen collects blood pressure readings (systolic, mmHg) from 200 patients. The distribution is approximately bell-shaped with a mean of 120 mmHg and a standard deviation of 15 mmHg.
- (a) What range of blood pressures includes about 68% of patients?
- (b) What percentage of patients have systolic blood pressure above 150 mmHg?
- (c) A patient has a blood pressure of 165 mmHg. Calculate the z-score. How unusual is this reading?
- (d) If Maya considers any reading with $|z| > 2$ to be "clinically concerning," what blood pressure range does she consider normal?
Answer
**(a)** $\bar{x} \pm 1s = 120 \pm 15 = $ **105 to 135 mmHg** **(b)** 150 mmHg is 2 standard deviations above the mean ($z = (150-120)/15 = 2.0$). By the Empirical Rule, about 95% of data falls within 2 SDs, so about 5% is beyond. Since the distribution is symmetric, about **2.5%** are above 150 mmHg. **(c)** $z = (165 - 120)/15 = 45/15 = 3.0$. This is **3 standard deviations above the mean** — very unusual. By the Empirical Rule, only about 0.15% of patients would have readings this high or higher. This is almost certainly clinically significant. **(d)** The "normal" range with $|z| \leq 2$ is: $120 \pm 2(15) = 120 \pm 30 = $ **90 to 150 mmHg**. Readings below 90 or above 150 would be flagged as clinically concerning. *Reference:* Sections 6.7, 6.818. Alex Rivera's team collects data on user satisfaction scores (scale 1-100) for two versions of the StreamVibe app:
- Version A: Mean = 72, SD = 8, Min = 45, Q1 = 67, Median = 73, Q3 = 78, Max = 95
-
Version B: Mean = 72, SD = 15, Min = 20, Q1 = 62, Median = 74, Q3 = 84, Max = 100
-
(a) Both versions have the same mean. Are they equally good? Explain using the other statistics.
- (b) Which version's five-number summary suggests a more skewed distribution?
- (c) A user rates Version A at 55 and another rates Version B at 55. Calculate z-scores for each. Which rating is more unusual relative to its version?
- (d) Alex's team wants the most consistently positive experience. Which version should they choose?
Answer
**(a)** Despite identical means, the versions are quite different. Version A has a smaller standard deviation (8 vs. 15), meaning scores are more tightly clustered around the mean — users have a more predictable experience. Version B has much wider spread — some users love it (100) but some are very dissatisfied (20). **(b)** **Version B** shows more skew. Its mean (72) is slightly below its median (74), and the distance from Q1 to the median (12) is larger than from the median to Q3 (10), with a much longer lower whisker. Version A is more symmetric (median of 73 is very close to the mean of 72). **(c)** - Version A: $z = (55 - 72)/8 = -17/8 = -2.13$ - Version B: $z = (55 - 72)/15 = -17/15 = -1.13$ The rating of 55 is more unusual for **Version A** ($|z| = 2.13$, beyond 2 SD) than for Version B ($|z| = 1.13$, within 2 SD). **(d)** **Version A.** Its smaller standard deviation (8 vs. 15) and narrower IQR (11 vs. 22) indicate users consistently have positive experiences. With Version A, most users score between 64 and 80. With Version B, the experience is a gamble. *Reference:* Sections 6.5, 6.6, 6.8, 6.11Section 4: True/False with Explanation (1 point each)
19. True or False: If a dataset has a standard deviation of 0, all values in the dataset are the same.
Answer
**True.** A standard deviation of 0 means every deviation from the mean is 0 — that is, $x_i - \bar{x} = 0$ for every value. This is only possible if every value equals the mean, which means all values are identical. There is no variation whatsoever. *Reference:* Section 6.520. True or False: The Empirical Rule can be applied to any dataset, regardless of its shape.