Chapter 5 Quiz: Exploratory Data Analysis
Question 1
John Tukey distinguished between two modes of statistical analysis. Which correctly describes Exploratory Data Analysis (EDA)?
A) Starting with a hypothesis and testing it against data using p-values and confidence intervals B) Starting with no hypothesis and letting the data reveal patterns, anomalies, and relationships C) Building a predictive model and using its performance metrics to understand the data D) Applying standard statistical tests to confirm that the data meets modeling assumptions
Question 2
A dataset of employee salaries has a mean of $95,000 and a median of $68,000. What can you conclude?
A) The data contains errors that should be corrected B) The distribution is left-skewed with a long tail of low earners C) The distribution is right-skewed, with a small number of high salaries pulling the mean upward D) The mean and median are both unreliable for this dataset
Question 3
Which of the following is the best definition of the Interquartile Range (IQR)?
A) The range from the minimum to the maximum value in the dataset B) The range of the middle 50% of data, calculated as Q3 minus Q1 C) The standard deviation multiplied by 2 D) The difference between the 90th and 10th percentiles
Question 4
A company's API response time has the following percentiles: P50 = 120ms, P90 = 450ms, P95 = 800ms, P99 = 3,200ms. The SLA guarantees that 95% of requests complete within 1,000 milliseconds. Based on these numbers:
A) The SLA is being violated — P95 exceeds 1,000ms B) The SLA is being met — P95 (800ms) is below the 1,000ms threshold C) There is not enough information to determine SLA compliance D) The SLA is irrelevant because the P99 is too high
Question 5
Edward Tufte's "data-ink ratio" principle states that:
A) Charts should use at least 5 colors to distinguish data series B) The proportion of a chart's ink devoted to actual data representation should be maximized C) All charts should include gridlines, borders, and legends for completeness D) 3D effects improve data comprehension by adding depth cues
Question 6
Consider the following Python code:
fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(categories, values, color='steelblue')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
What do the last two lines accomplish?
A) They remove the x-axis and y-axis labels B) They remove the top and right borders of the plot area, reducing visual clutter C) They make the bars transparent on the top and right sides D) They hide the tick marks on the top and right axes
Question 7
In a correlation heatmap, a cell showing r = -0.72 between "employee satisfaction" and "turnover rate" means:
A) Satisfied employees cause low turnover B) There is a strong negative linear association — as satisfaction increases, turnover tends to decrease C) There is no meaningful relationship between the variables D) Turnover causes dissatisfaction
Question 8
A data scientist removes all data points beyond 2 standard deviations from the mean before building a model. Professor Okonkwo would likely critique this approach because:
A) 2 standard deviations is too aggressive — 3 standard deviations is the standard threshold B) The removed points might represent the most important observations for the business question C) Outlier removal should only be done after the model is built D) Standard deviation is not a valid measure for outlier detection
Question 9
Which type of missing data (MCAR, MAR, or MNAR) is described by the following scenario?
"High-income survey respondents are more likely to leave the income question blank because they don't want to reveal their high income."
A) MCAR — the missingness is completely random B) MAR — the missingness depends on other observed variables like age C) MNAR — the probability of missingness depends on the missing value itself D) None of the above — this is not a missing data problem
Question 10
What is the primary advantage of using seaborn's violinplot() over a standard boxplot()?
A) Violin plots are faster to render than box plots B) Violin plots show the full distribution shape (density), while box plots only show summary statistics C) Violin plots can handle categorical data, while box plots cannot D) Violin plots automatically remove outliers
Question 11
The SCQA framework for data storytelling stands for:
A) Statistics, Charts, Queries, Analysis B) Situation, Complication, Question, Answer C) Summary, Context, Quantification, Action D) Source, Computation, Quality, Assessment
Question 12
You create a scatter plot of two variables and find r = 0.02 (nearly zero correlation). However, the scatter plot clearly shows a U-shaped pattern. What should you conclude?
A) There is no relationship between the variables B) There is a strong nonlinear relationship that the correlation coefficient fails to capture C) The data contains errors that created a false pattern D) The correlation should be recalculated using a larger sample size
Question 13
Examine the following code from the EDAReport class:
self.numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
self.categorical_cols = df.select_dtypes(
include=['object', 'category']
).columns.tolist()
Why does the class separate numeric and categorical columns during initialization?
A) To improve runtime performance by using parallel processing B) Because different analysis techniques apply to different data types — numeric columns get statistics and histograms, categorical columns get value counts and frequency tables C) Because pandas cannot compute statistics on mixed data types D) To prevent errors when computing the correlation matrix
Question 14
A correlation matrix reveals that customer satisfaction scores have near-zero correlation with actual purchasing behavior. Which of the following is the most appropriate business response?
A) Discard satisfaction survey data entirely — it's worthless B) Investigate whether the survey is measuring the right construct, and whether the relationship might be nonlinear or mediated by other variables C) Assume the correlation will appear with more data and continue collecting surveys unchanged D) Replace the satisfaction survey with a purchase prediction model
Question 15
In the EDAReport class, the executive_summary() method automatically flags columns with skewness above 1.5. Why is this threshold meaningful?
A) Skewness above 1.5 indicates data errors B) Skewness above 1.5 indicates a highly asymmetric distribution where the mean is substantially misleading and outliers will disproportionately affect models that assume normality C) Skewness above 1.5 means the data should be deleted D) Skewness above 1.5 is the standard threshold for statistical significance
Question 16 (Short Answer)
A dataset contains 50,000 customer records. The annual_revenue column has the following properties: mean = $12,400, median = $3,200, skewness = 4.7, kurtosis = 28.3.
In 3-4 sentences, explain what these statistics tell you about the customer revenue distribution and what implications this has for business decision-making.
Question 17 (Short Answer)
Explain the difference between MAR (Missing at Random) and MNAR (Missing Not at Random) using a real-world business example for each. Why does the distinction matter for data analysis?
Question 18
Which of the following is an example of the "Lie Factor" Tufte warns about?
A) A bar chart where the y-axis starts at zero B) A chart showing a 10% increase in sales using a pictorial icon that is 3x larger than the comparison icon C) A chart with too many gridlines D) A pie chart with more than 5 slices
Question 19
In the EDAReport.plot_distributions() method, mean is shown as a red dashed line and median as an orange solid line. When these two lines are far apart on a histogram, it indicates:
A) The data has many missing values B) The histogram bin width is too large C) The distribution is substantially skewed, with outliers pulling the mean away from the center of the data D) The data was generated incorrectly
Question 20
A consulting team spent $200,000 building a churn prediction model that achieved 94% accuracy. Professor Okonkwo says the model is useless. Why?
A) 94% accuracy is below the industry standard of 99% B) The dataset was 94% non-churners, so the model likely learned to predict "no churn" for everyone — achieving high accuracy while failing at its actual objective C) The consulting team used the wrong programming language D) Accuracy is never a valid metric for classification models
Question 21 (Short Answer)
You are conducting EDA on a new dataset and discover that the customer_age column contains values ranging from -5 to 847. Describe the three steps you would take to investigate and handle this data quality issue. Reference specific EDA techniques from this chapter.
Question 22
The EDAReport class stores self.df = df.copy() rather than self.df = df. Why?
A) copy() makes the code run faster
B) copy() creates an independent copy so that any modifications during analysis (like dropping rows or imputing values) don't alter the original DataFrame
C) copy() is required by pandas to access column names
D) There is no practical difference between the two approaches
Question 23
When Ravi Mehta's team at Athena Retail Group discovered that customer satisfaction scores had near-zero correlation with purchasing behavior, the appropriate next step was to:
A) Delete the satisfaction data and cancel the survey program B) Build a more complex model that accounts for the low correlation C) Question whether the survey is measuring the right thing and investigate whether the relationship is nonlinear or mediated by unmeasured variables D) Report to the CEO that customers are satisfied and no action is needed
Question 24 (Short Answer)
Professor Okonkwo says: "A chart that no one reads is a chart that failed." In 3-4 sentences, explain how the "So What?" test and the practice of using insight-driven titles (rather than descriptive titles) address this problem. Give one example of a descriptive title and its improved insight-driven version.
Question 25
Which of the following correctly describes kurtosis?
A) A measure of how peaked or flat a distribution is B) A measure of how heavy the tails of a distribution are, indicating the probability of extreme values C) A measure of how skewed a distribution is D) A measure of how many outliers exist in a dataset
Answer Key
1. B
2. C
3. B
4. B — P95 is 800ms, which is below the 1,000ms SLA threshold. While P99 is high, the SLA specifies 95%, not 99%.
5. B
6. B
7. B — Correlation measures association, not causation. The statement describes the direction and strength of the linear relationship without implying causality.
8. B
9. C
10. B
11. B
12. B — The correlation coefficient only captures linear relationships. A U-shaped pattern is a clear nonlinear relationship that r = 0.02 completely misses.
13. B
14. B
15. B
16. Sample answer: This distribution is extremely right-skewed (skewness = 4.7), meaning a small number of customers generate vastly more revenue than the typical customer. The mean ($12,400) is nearly 4x the median ($3,200), indicating that "average revenue" severely overstates what a typical customer contributes. The very high kurtosis (28.3) signals fat tails — extreme outliers are far more common than a normal distribution would predict. For business purposes, median revenue and percentile breakdowns should be used for planning, and the company should investigate whether its high-value tail represents a strategically distinct customer segment.
17. Sample answer: MAR means the probability of a value being missing depends on other observed variables. For example, younger employees may be less likely to report salary on an internal survey — the missingness in salary relates to age (which is observed), not to salary itself. MNAR means the probability of a value being missing depends on the missing value itself. For example, customers with extremely high return rates may be less likely to complete post-purchase surveys because they are dissatisfied — the missingness relates directly to the satisfaction level we're trying to measure. The distinction matters because MAR can be addressed through imputation using the observed predictor variables, while MNAR creates systematic bias that no imputation method can fully correct.
18. B
19. C
20. B
21. Sample answer: First, I would use descriptive statistics (min, max, percentiles) and a histogram to understand the full distribution and see how many values fall outside a reasonable range (e.g., 0-120). Second, I would use the IQR outlier detection method to systematically identify impossible values (negative ages, ages above 120) and examine them — checking whether they are systematic errors (e.g., a specific data source) or random typos. Third, I would create a missing data flag for these invalid records and decide whether to impute with the median, correct from source systems, or exclude them — documenting the decision and its impact on the dataset size.
22. B
23. C
24. Sample answer: The "So What?" test requires that every chart immediately suggest an action or decision — not just display information. A chart titled "Revenue by Quarter" describes data; a chart titled "Q3 Revenue Dropped 12%, Driven by European Markets" communicates an insight that demands attention. Insight-driven titles do the interpretive work for the reader, which is crucial when presenting to executives who spend seconds, not minutes, on each visual. This practice ensures that charts are communication tools, not decoration.
25. B — Kurtosis measures tail heaviness (the probability of extreme values), not peakedness. This is a common textbook misconception. High kurtosis means extreme events are more likely than a normal distribution would predict.