Chapter 25 Quiz: Descriptive Statistics for Business Decisions

DataField.Dev

Chapter 25 Quiz: Descriptive Statistics for Business Decisions

Instructions: Choose the best answer for each multiple-choice question. For short-answer questions, write 1–3 sentences. The answer key is at the end.

Multiple Choice

Question 1

Your company's average (mean) customer lifetime value is reported as $8,400. However, when you look at the median customer lifetime value, it is $3,200. The most likely explanation is:

A) There was a calculation error in the mean B) A small number of very high-value customers are pulling the mean upward C) A large number of customers have values near $8,400 D) The data is left-skewed

Question 2

A sales manager is comparing two regions. Region A has a mean monthly revenue of $120,000 with a standard deviation of $8,000. Region B has a mean monthly revenue of $122,000 with a standard deviation of $51,000. Which of the following is the most accurate business interpretation?

A) Region B is clearly the better-performing region because its mean is higher B) Region A is the better-performing region because its mean is higher C) Region B has higher average revenue but far less consistent performance — the higher mean comes with much higher risk D) The two regions are essentially equal because their means are close

Question 3

The IQR (Interquartile Range) of a dataset measures:

A) The distance between the minimum and maximum values B) The average deviation of each value from the mean C) The spread of the middle 50% of values D) The most common range in which values appear

Question 4

You calculate the correlation between marketing spend and new customer acquisitions across 24 months and get r = 0.82. Which of the following conclusions is justified?

A) Increasing marketing spend causes customer acquisitions to increase B) Marketing spend and new customer acquisitions have a strong positive linear relationship C) If you increase marketing spend by $10,000, you will gain exactly 82% more customers D) Marketing spend causes 82% of all new customer acquisitions

Question 5

Which measure of central tendency is most resistant to outliers?

A) Mean B) Median C) Mode D) Standard deviation

Question 6

A retail analyst reports: "Our average transaction value is $47." This number is most useful when:

A) The distribution of transaction values is right-skewed with a few very large transactions B) There are many extreme outliers in both directions C) Transaction values are relatively consistent across customers (low standard deviation) D) The analyst is trying to understand what the typical transaction looks like after a big sale

Question 7

You are looking at a histogram of employee salaries. The shape shows most values clustered on the left side with a long tail extending to the right. This describes:

A) A left-skewed distribution B) A right-skewed distribution C) A normal distribution D) A bimodal distribution

Question 8

What does a z-score of -2.8 for a particular value in a dataset indicate?

A) The value is 2.8 times the mean B) The value is 2.8 percentage points below the mean C) The value is 2.8 standard deviations below the mean D) The value is in the bottom 2.8th percentile

Question 9

Simpson's Paradox refers to:

A) The tendency for averages to be misleadingly high when outliers are present B) A situation where a trend that appears in separate groups disappears or reverses when the groups are combined C) The finding that correlation does not imply causation D) The phenomenon where the mean and median diverge in skewed distributions

Question 10

The standard deviation of daily sales in a bakery is $240 with a mean of $1,800. The coefficient of variation is approximately:

A) 24% B) 13.3% C) 1,560% D) 7.5%

Question 11

A box plot's box spans from the 25th to the 75th percentile. The line inside the box represents:

A) The mean B) The mode C) The median (50th percentile) D) The range

Question 12

Your company analyzes only its currently active customers to understand what drives long-term loyalty. A colleague points out that this analysis might suffer from:

A) Simpson's Paradox B) Survivorship bias C) Correlation vs. causation confusion D) Skewness error

Question 13

You run df["revenue"].corr(df["support_costs"]) and get r = -0.61. This means:

A) Support costs cause revenue to decline B) When revenue is higher, support costs tend to be lower C) Support costs are 61% lower than revenue D) There is a moderate negative linear relationship between revenue and support costs

Question 14

When pandas describe() shows that the 50th percentile (median) equals $145,000 and the mean equals $145,300, this indicates the data distribution is:

A) Heavily right-skewed B) Heavily left-skewed C) Approximately symmetric D) Bimodal

Question 15

For inventory planning, you want to understand the typical variation in weekly demand for a product. Which statistic would be MOST useful?

A) Mean demand B) Maximum demand C) Standard deviation of demand D) Mode of demand

Question 16

The 90th percentile of customer support resolution time at your company is 8.5 hours. Your SLA promises 90% of tickets resolved within 6 hours. This means:

A) You are meeting your SLA because 90% of tickets resolve in 8.5 hours or less B) You are NOT meeting your SLA — 10% of tickets take more than 8.5 hours, well above your 6-hour promise C) You are NOT meeting your SLA — the 90th percentile resolution time (8.5 hours) exceeds your 6-hour promise D) The SLA is irrelevant because the mean resolution time might be under 6 hours

Question 17

Which of the following is the best description of what a histogram reveals that a single summary statistic cannot?

A) The exact mean and median of the dataset B) The shape of the distribution — how values are spread, whether there are clusters, tails, or multiple peaks C) The specific outlier values in the dataset D) The correlation between two variables

Question 18

A product manager says: "Our users who use Feature X have a 40% higher retention rate, so we should push Feature X to all users." What critical error might this reasoning contain?

A) The manager is confusing median with mean B) The manager may be confusing correlation with causation — perhaps users who are already engaged both use Feature X and retain at higher rates C) The manager should have used standard deviation instead of retention rate D) The manager is ignoring the IQR

Short Answer

Question 19

You are the analytics lead for a consulting firm. A partner tells you: "Our average project value this year was $285,000 — up 40% from last year's $204,000." You check the data and find that this year you landed two unusually large contracts worth $1.2M and $980K that you did not have last year.

a) What would you calculate to give a more accurate picture of how the "typical" project value changed year over year?

b) Write the one-sentence correction you would deliver to the partner before the board presentation.

Question 20

Explain the difference between standard deviation and IQR as measures of spread. When would you prefer to use IQR over standard deviation in a business context? Give one specific business example.

Answer Key

Multiple Choice:

B — The mean is the "balancing point" of the distribution and is pulled toward outliers. A median of $3,200 with a mean of $8,400 indicates a right-skewed distribution driven by a small number of very high-value customers.
C — Region B's slightly higher mean comes with a standard deviation more than six times larger than Region A's. The "better" region depends on whether management values predictability (Region A) or is willing to accept high variance for a marginally higher average (Region B).
C — IQR = Q3 - Q1, which spans the middle 50% of values. It ignores the top 25% and bottom 25% entirely, making it robust to outliers.
B — A correlation of 0.82 tells us there is a strong positive linear relationship. It does not prove causation (Answer A), does not quantify a specific change relationship (Answer C), and does not express the percentage of customers attributable to spending (Answer D).
B — The median is the middle value and is not affected by extreme values at either end. The mean is pulled toward outliers. The mode and standard deviation have other purposes.
C — The mean is most useful and representative when values are clustered closely together (low standard deviation). When the distribution is skewed or has outliers, the median is a better "typical" measure.
B — Right-skewed (also called positively skewed) distributions have most values on the left with a tail extending right. Left-skewed is the opposite.
C — A z-score measures how many standard deviations a value is from the mean. A z-score of -2.8 means the value is 2.8 standard deviations below the mean. This does not directly translate to a specific percentile without additional information.
B — Simpson's Paradox is specifically the phenomenon where a trend in segmented groups reverses or disappears when the groups are combined, often due to unequal group sizes and a lurking variable.
B — Coefficient of Variation = (Std Dev / Mean) × 100 = ($240 / $1,800) × 100 = 13.3%
C — The line inside the box in a standard box plot represents the median (50th percentile). The box spans Q1 to Q3.
B — Survivorship bias occurs when you only analyze cases that "survived" a selection process (in this case, customers who are still active). Customers who churned are excluded, which can lead to incorrect conclusions about what drives loyalty.
D — The correlation coefficient measures a linear relationship. r = -0.61 indicates a moderate negative relationship. The negative direction means as one variable increases, the other tends to decrease — but this is not a statement about causation (eliminating A) and -0.61 is not a percentage (eliminating C).
C — When mean ≈ median, the distribution is approximately symmetric. A large gap between mean and median signals skewness.
C — Standard deviation of demand tells you how much demand varies from week to week, which is what you need to plan safety stock and buffer capacity. The mean tells you the average, but you need the variability to plan for uncertainty.
C — Your SLA promises 90% of tickets resolved within 6 hours. Your actual 90th percentile is 8.5 hours — meaning 90% of tickets resolve within 8.5 hours, which is worse than the promised 6 hours. You are not meeting the SLA.
B — A histogram reveals the shape of the distribution: whether it is symmetric, skewed, bimodal, or uniform; where values cluster; and whether there are gaps or unusual patterns. A single statistic like the mean cannot capture this.
B — This is the classic correlation vs. causation error in the context of product analytics. Highly engaged users are more likely to both discover and use Feature X and to retain. The feature may not be causing retention; both might be caused by underlying engagement. A proper test would require a randomized experiment.

Short Answer:

a) Calculate the median project value for both years and compare the medians. Alternatively, calculate the mean with and without the two outlier contracts to understand how much they are inflating this year's mean. Also consider reporting the distribution: "excluding the two enterprise contracts, average project value was $X."

b) Example correction: "Before we present this number, we should note that two unusually large contracts are driving most of that 40% increase — the median project value (which reflects our typical engagement) rose by only X%, which is a more honest representation of how our core business grew."

Standard deviation measures the average distance of all data points from the mean, using all the data points. It is affected by extreme values (outliers) because the outliers pull the mean, which then makes all other distances look different than they would otherwise.

IQR measures only the spread of the middle 50% of data, completely ignoring the top and bottom 25%. It is not affected by outliers at all.

Prefer IQR when: You have a skewed distribution or known outliers that you do not want to distort your spread measure. A good business example: measuring consistency of daily delivery times when occasional extreme weather delays create very long-tail outliers. The IQR would show how consistent deliveries are on a typical day, while the standard deviation would inflate to account for the rare extreme-weather days.