Chapter 19 Quiz: Descriptive Statistics

Q: Which of the following is the BEST description of standard deviation? - (A) The difference between the largest and smallest values - (B) The middle value of the dataset - (C) A measure of how far, on average, values are from the mean - (D) The value that appears most frequently in the data

Correct: (C) Standard deviation measures the average distance of values from the mean. Technically, it's the square root of the average squared deviation, but the intuition is "how far a typical value is from the center." (A) describes the range, (B) the median, and (D) the mode.

Q: True or False: The variance is always non-negative (zero or positive).

True. Variance is calculated as the average of squared deviations from the mean. Since squaring any number produces a non-negative result, the average of squared values must also be non-negative. Variance equals zero only when all values are identical.

Q: True or False: For right-skewed data, the mean is typically less than the median.

False. For right-skewed data, the mean is typically greater than the median. The long tail to the right pulls the mean upward. The correct relationship is: left-skewed: mean median.

Contributors to Introduction to Data Science

Chapter 19 Quiz: Descriptive Statistics

Instructions: This quiz tests your understanding of Chapter 19. Answer all questions before checking the solutions. For multiple choice, select the best answer. For short answer questions, aim for 2-4 clear sentences. Total points: 100.

Section 1: Multiple Choice (10 questions, 4 points each)

Question 1. A dataset of employee salaries has a mean of $85,000 and a median of $62,000. What does this tell you about the distribution's shape?

(A) The distribution is symmetric
(B) The distribution is left-skewed
(C) The distribution is right-skewed
(D) The distribution is bimodal

Answer

**Correct: (C)** When the mean is substantially higher than the median, the distribution is right-skewed. A few very high salaries (executives, senior leadership) are pulling the mean to the right while the median — unaffected by extreme values — reflects the "typical" employee's salary. This is the classic pattern for income data.

Question 2. Which of the following is the BEST description of standard deviation?

(A) The difference between the largest and smallest values
(B) The middle value of the dataset
(C) A measure of how far, on average, values are from the mean
(D) The value that appears most frequently in the data

Answer

**Correct: (C)** Standard deviation measures the average distance of values from the mean. Technically, it's the square root of the average squared deviation, but the intuition is "how far a typical value is from the center." (A) describes the range, (B) the median, and (D) the mode.

Question 3. A dataset has Q1 = 20, Q3 = 50. Using the IQR fence method, what are the lower and upper fences?

(A) Lower = -25, Upper = 95
(B) Lower = -10, Upper = 80
(C) Lower = 5, Upper = 65
(D) Lower = 20, Upper = 50

Answer

**Correct: (B)** IQR = Q3 - Q1 = 50 - 20 = 30. Lower fence = Q1 - 1.5 * IQR = 20 - 45 = -25... wait. Let me recheck. Q1 = 20, Q3 = 50, IQR = 30. Lower fence = 20 - 1.5(30) = 20 - 45 = -25. Upper fence = 50 + 1.5(30) = 50 + 45 = 95. **Correct: (A)** Correction: The answer is **(A)**. Lower fence = Q1 - 1.5 * IQR = 20 - 45 = -25. Upper fence = Q3 + 1.5 * IQR = 50 + 45 = 95. Any value below -25 or above 95 would be flagged as an outlier.

Question 4. When is the median a better measure of center than the mean?

(A) When the data is symmetric and has no outliers
(B) When the data is categorical
(C) When the data is skewed or contains outliers
(D) When the sample size is very large

Answer

**Correct: (C)** The median is *robust* — it's not affected by extreme values. When data is skewed (like income, home prices, or response times) or contains outliers, the mean gets pulled toward the extreme values and no longer represents the "typical" value. The median stays anchored at the middle. (A) describes when the mean works fine. (B) is wrong — for categorical data, you'd use the mode, not the median (unless it's ordinal). (D) is irrelevant to the choice.

Question 5. A student's z-score on an exam is -1.5. This means:

(A) The student scored 1.5 points below the class average
(B) The student scored 1.5 standard deviations below the class mean
(C) The student scored in the bottom 1.5% of the class
(D) The student's score was 1.5 times the class average

Answer

**Correct: (B)** A z-score of -1.5 means the student's score is 1.5 standard deviations below the mean. Z-scores measure distance from the mean in units of standard deviations, not in raw points (A) or percentages (C). If the mean was 80 and the SD was 10, a z-score of -1.5 means the student scored 80 - 1.5(10) = 65.

Question 6. What is the IQR?

(A) The range of the entire dataset (max - min)
(B) The range of the middle 50% of the data (Q3 - Q1)
(C) The average distance from the median
(D) The difference between the mean and the median

Answer

**Correct: (B)** The Interquartile Range (IQR) is Q3 - Q1 — the range of the middle 50% of the data. It's a robust measure of spread that ignores the top 25% and bottom 25%, making it resistant to outliers. (A) describes the range, (C) describes something like the mean absolute deviation, and (D) is not a standard measure.

Question 7. You're analyzing a dataset and notice it has two distinct peaks (modes). This distribution is called:

(A) Uniform
(B) Normal
(C) Bimodal
(D) Skewed

Answer

**Correct: (C)** A bimodal distribution has two distinct peaks. This often indicates that the data contains two distinct subgroups mixed together. For example, measuring heights of both NBA players and average adults would produce a bimodal distribution. Reporting a single mean for bimodal data is usually misleading.

Question 8. Which pandas method gives you the five-number summary plus the mean and standard deviation?

(A) .info()
(B) .describe()
(C) .summary()
(D) .value_counts()

Answer

**Correct: (B)** The `.describe()` method returns count, mean, std, min, 25% (Q1), 50% (median), 75% (Q3), and max — which includes the five-number summary plus the mean and standard deviation. `.info()` shows column types and non-null counts. `.summary()` doesn't exist in pandas. `.value_counts()` shows frequency of each unique value.

Question 9. If you add a constant value (say, 10) to every data point in a dataset, which of the following changes?

(A) The mean changes, but the standard deviation does not
(B) Both the mean and standard deviation change
(C) Neither the mean nor the standard deviation changes
(D) The standard deviation changes, but the mean does not

Answer

**Correct: (A)** Adding a constant shifts every value (and therefore the mean) by that constant, but it doesn't change how spread out the values are — the distances between values stay the same. So the mean increases by 10, but the standard deviation, variance, and IQR remain unchanged. This is a key property of these measures.

Question 10. Anscombe's Quartet demonstrates that:

(A) The mean is always the best measure of center
(B) Summary statistics alone can miss important patterns in data — always plot
(C) Outliers should always be removed before analysis
(D) The normal distribution is the most common shape in nature

Answer

**Correct: (B)** Anscombe's Quartet consists of four datasets with nearly identical means, standard deviations, correlations, and regression lines — but completely different scatter plots. The lesson is that summary statistics can hide crucial structure (nonlinearity, outliers, clusters), and you should always visualize your data alongside computing statistics.

Section 2: True/False (3 questions, 5 points each)

Question 11. True or False: The variance is always non-negative (zero or positive).

Answer

**True.** Variance is calculated as the average of squared deviations from the mean. Since squaring any number produces a non-negative result, the average of squared values must also be non-negative. Variance equals zero only when all values are identical.

Question 12. True or False: For right-skewed data, the mean is typically less than the median.

Answer

**False.** For right-skewed data, the mean is typically *greater than* the median. The long tail to the right pulls the mean upward. The correct relationship is: left-skewed: mean < median; symmetric: mean ≈ median; right-skewed: mean > median.

Question 13. True or False: The mode is the only measure of center that can be used with categorical data.

Answer

**True.** You can find the most frequent category (mode), but computing a mean or median of categories like "red, blue, green" is meaningless. For ordinal data (like "low, medium, high"), you might compute a median if you assign numerical values, but the mode is always applicable. For nominal categorical data, the mode is the only meaningful measure of center.

Section 3: Short Answer (3 questions, 5 points each)

Question 14. Explain in your own words what the "five-number summary" is and why it's more informative than just reporting the mean and standard deviation.

Answer

The five-number summary consists of the minimum, Q1 (25th percentile), median, Q3 (75th percentile), and maximum. It's more informative than mean + standard deviation because: (1) it shows the full range of the data (min to max), (2) it identifies where the middle 50% falls (Q1 to Q3), (3) it's robust to outliers (unlike mean and SD), and (4) it reveals asymmetry — if the median is closer to Q1 than Q3, the distribution is right-skewed. The five-number summary gives you the skeleton of the distribution's shape, while mean + SD only give you a center and a spread that assume symmetry.

Question 15. A colleague says: "This data point is an outlier, so I'm going to delete it." What questions should you ask before agreeing?

Answer

Key questions: (1) **Why is it an outlier?** Is it a data entry error, a measurement malfunction, or a genuine extreme observation? (2) **What caused it?** Is there a known explanation (e.g., a special event, a different population)? (3) **Does removing it change the conclusions?** Run the analysis with and without it. (4) **Is it interesting?** Sometimes outliers are the most important data points — they might reveal something that the "normal" data doesn't show. You should only remove outliers when you have a clear justification (error, different population) — not simply because they're inconvenient.

Question 16. Explain what skewness measures and give one real-world example of a right-skewed distribution and one of a left-skewed distribution.

Answer

Skewness measures the asymmetry of a distribution — how lopsided it is. A skewness of zero means symmetric; positive means right-skewed (long tail to the right); negative means left-skewed (long tail to the left). Right-skewed example: Household income — most people earn moderate incomes, but a few earn extremely high amounts, creating a long right tail. Left-skewed example: Age at death in developed countries — most people live to old age (clustering at the right), but some die young (creating a left tail).

Section 4: Applied Scenarios (2 questions, 7.5 points each)

Question 17. Priya is analyzing NBA player salaries. She computes the following statistics:

Mean salary: $8.2 million
Median salary: $4.1 million
Standard deviation: $9.5 million
IQR: $8.8 million
Skewness: +2.3

(a) What shape is the salary distribution? (b) Which measure of center should she report, and why? (c) Which measure of spread should she report, and why? (d) A rookie earns $1.2 million. Is this an outlier? How would you check?

Answer

(a) The distribution is strongly right-skewed (skewness = +2.3, mean >> median). A few superstar players earn tens of millions, pulling the tail rightward. (b) She should report the **median** ($4.1M). The mean ($8.2M) is inflated by a few max-contract players and doesn't represent what a "typical" NBA player earns. (c) She should report the **IQR** ($8.8M). The standard deviation ($9.5M) is inflated by the same extreme values. The IQR gives a better sense of the spread of the middle 50% of salaries. (d) To check: use the IQR method. Q1 ≈ median - IQR/2 is roughly around $0M (we'd need exact Q1). A $1.2M salary probably falls within normal range for the lower quartile of NBA players — many rookies and bench players earn near the league minimum. Not likely an outlier.

Question 18. Elena has vaccination rate data for 50 countries. She computes .describe() and sees:

count     50.000000
mean      71.400000
std       18.200000
min       23.000000
25%       58.000000
50%       74.000000
75%       87.000000
max       98.000000

(a) Is the distribution likely symmetric, right-skewed, or left-skewed? How can you tell? (b) Compute the IQR and the lower fence. Are there likely outliers on the low end? (c) The mean (71.4) is below the median (74.0). What does this confirm about the shape? (d) Elena wants to tell her supervisor "the typical vaccination rate is about X%." What value should X be, and why?

Answer

(a) **Left-skewed.** The median (74) is greater than the mean (71.4), and the distance from min to median (51 points) is much greater than from median to max (24 points). The left tail is longer. (b) IQR = 87 - 58 = 29. Lower fence = 58 - 1.5(29) = 58 - 43.5 = 14.5. The minimum value is 23, which is above 14.5, so technically no outliers by this method. But 23% is quite far from the bulk of the data and worth investigating. (c) Mean < Median confirms left skew. The few countries with very low vaccination rates (like the 23% minimum) are pulling the mean downward. (d) Elena should report the **median of 74%**. Since the distribution is left-skewed, the median better represents the "typical" country. She could say: "The typical vaccination rate is about 74%, with half of countries between 58% and 87%."

Section 5: Code Analysis (2 questions, 5 points each)

Question 19. What is the output of the following code?

import numpy as np

data = [10, 20, 20, 30, 40, 50, 50, 50, 60, 1000]
print(f"Mean: {np.mean(data):.1f}")
print(f"Median: {np.median(data):.1f}")
print(f"Std (ddof=0): {np.std(data):.1f}")
print(f"Std (ddof=1): {np.std(data, ddof=1):.1f}")

Answer

Mean: 133.0
Median: 45.0
Std (ddof=0): 281.8
Std (ddof=1): 297.0

The mean (133.0) is heavily inflated by the outlier value of 1000 — it's nearly three times the median (45.0). The standard deviation is enormous because of the same outlier. Note that `ddof=1` gives a slightly larger value than `ddof=0` — this is Bessel's correction, which divides by n-1 instead of n. For this dataset, the median (45.0) is a much better summary of what's "typical."

Question 20. The following code computes z-scores. What does it print, and which value(s) would be flagged as outliers using a z > 2 threshold?

import numpy as np

data = np.array([12, 14, 11, 13, 15, 14, 13, 12, 14, 50])
mean = np.mean(data)
std = np.std(data, ddof=1)
z_scores = (data - mean) / std

for val, z in zip(data, z_scores):
    flag = " <-- OUTLIER" if abs(z) > 2 else ""
    print(f"Value: {val:4d}, z-score: {z:+.2f}{flag}")

Answer

The mean is 16.8 and the standard deviation (ddof=1) is approximately 11.3. The z-scores for 12, 14, 11, etc. are all around -0.3 to -0.5. The value 50 has a z-score of approximately +2.94, which exceeds the threshold of 2.0 and would be flagged as an outlier. It's the only value far enough from the mean to be flagged.

Value:   12, z-score: -0.42
Value:   14, z-score: -0.24
Value:   11, z-score: -0.51
Value:   13, z-score: -0.33
Value:   15, z-score: -0.16
Value:   14, z-score: -0.24
Value:   13, z-score: -0.33
Value:   12, z-score: -0.42
Value:   14, z-score: -0.24
Value:   50, z-score: +2.94 <-- OUTLIER