Chapter 2 Quiz: Thinking Like a Data Scientist
Multiple Choice
Question 1. Which phase of CRISP-DM is most commonly associated with data science project failures?
- (a) Data Preparation
- (b) Modeling
- (c) Business Understanding
- (d) Deployment
Question 2. A supermarket finds that sales of umbrellas and sales of hot soup are positively correlated. Which of the following is the most likely explanation?
- (a) Buying umbrellas makes people want soup
- (b) Buying soup makes people want umbrellas
- (c) Both are driven by a confounding variable (rainy/cold weather)
- (d) This is a coincidence with no underlying connection
Question 3. A customer survey uses a 5-point scale (1 = Very Dissatisfied to 5 = Very Satisfied). This data is best classified as:
- (a) Nominal
- (b) Ordinal
- (c) Interval
- (d) Ratio
Question 4. What percentage of total project time do most data science practitioners estimate is spent on data preparation?
- (a) 10–20%
- (b) 30–40%
- (c) 60–80%
- (d) 90–95%
Question 5. An analyst examines 200 potential correlations in a dataset and finds 12 that are "statistically significant" at the p < 0.05 level. Which of the following is the most appropriate interpretation?
- (a) All 12 correlations represent genuine relationships
- (b) Roughly 10 of the 12 might be expected by chance alone
- (c) The analyst should report only the strongest correlation
- (d) Statistical significance at p < 0.05 guarantees practical importance
Question 6. "Which of our at-risk customers should receive a personalized discount offer, and what discount amount maximizes the probability of retention while minimizing revenue loss?" This is an example of:
- (a) Descriptive analytics
- (b) Diagnostic analytics
- (c) Predictive analytics
- (d) Prescriptive analytics
Question 7. Which of the following is NOT a characteristic of the data science mindset as described in this chapter?
- (a) Skepticism before certainty
- (b) Comfort with uncertainty
- (c) Reliance on intuition over process
- (d) Emphasis on reproducibility
Question 8. A company finds that employees who participate in its mentoring program receive higher performance ratings. The CEO wants to require all employees to participate. What is the primary analytical concern with this plan?
- (a) The mentoring program may be too expensive to scale
- (b) Self-selection bias — high-performing employees may disproportionately choose to participate
- (c) Performance ratings are not measured on a ratio scale
- (d) The data is unstructured
Question 9. Semi-structured data is best described as:
- (a) Data that has been partially cleaned
- (b) Data with some organizational properties (tags, hierarchies) but not in rigid row-and-column format
- (c) Data that is half structured and half unstructured
- (d) Data stored in a data lake
Question 10. An organization reports that its "average customer lifetime value is $8,500." Which of the following scenarios would make this single number most misleading?
- (a) The data follows a normal distribution with a small standard deviation
- (b) All customers have lifetime values between $7,000 and $10,000
- (c) The distribution is heavily right-skewed, with most customers worth $2,000 and a few worth $50,000+
- (d) The data was collected from a large, representative sample
Question 11. In the CRISP-DM framework, what distinguishes the Evaluation phase from the Modeling phase?
- (a) Evaluation uses more advanced algorithms
- (b) Evaluation assesses both technical performance and business alignment
- (c) Evaluation is only done once, while modeling is iterative
- (d) Evaluation uses different data than modeling
Question 12. The term "dark data" refers to:
- (a) Data obtained through illegal or unethical means
- (b) Data that has been corrupted or lost
- (c) Data that organizations collect and store but never analyze
- (d) Unstructured data that cannot be processed by current technology
Question 13. A sales team's worst-performing member improves significantly the following quarter without any intervention. This is most likely an example of:
- (a) The Hawthorne effect
- (b) Regression to the mean
- (c) The law of large numbers
- (d) Survivorship bias
Question 14. Which of the following correctly describes the relationship between batch processing and stream processing?
- (a) Batch processing is always superior for business applications
- (b) Stream processing is always superior for business applications
- (c) Batch processing handles data in periodic chunks; stream processing handles data continuously in real time
- (d) They are different names for the same process
Question 15. A product team encodes product categories as numeric values (Electronics = 1, Clothing = 2, Home & Garden = 3) and feeds them into a regression model. What problem does this create?
- (a) The model will ignore these variables
- (b) The model will treat the categories as having a meaningful numeric order and distance
- (c) The model will crash due to non-numeric input
- (d) No problem — this is the correct approach
Short Answer
Question 16. Explain the difference between statistical significance and practical significance. Give a business example where a result might be statistically significant but practically meaningless.
Question 17. A health insurance company discovers that customers who download its mobile wellness app have 28% lower claims costs than customers who don't. The CMO proposes a campaign to drive app downloads among all members, projecting that this will reduce total claims costs by 28%.
Identify the analytical flaw in this reasoning. What type of causal error is being made? What would you recommend instead?
Question 18. Describe the "last mile" problem of analytics. Name two specific practices an organization can adopt to improve the translation of analytical insights into business action.
Question 19. You're evaluating two data science projects for funding:
- Project A proposes a sophisticated deep learning model that achieves 97% accuracy on a test dataset but requires dedicated GPU infrastructure and a team of three ML engineers to maintain.
- Project B proposes a logistic regression model that achieves 89% accuracy and can be deployed as a simple API endpoint maintained by the existing IT team.
Both projects address the same business problem. Assuming the 8-percentage-point accuracy gap translates to approximately $200,000 per year in additional value, but Project A costs $400,000 more per year to operate, which project would you recommend and why? Connect your answer to concepts from this chapter.
Question 20. Explain what a confounding variable is. Then consider this finding: "Cities with more Starbucks locations have higher average housing prices." Identify the most likely confounding variable(s) and explain why acting on this correlation (e.g., opening Starbucks to increase housing values) would be misguided.
Answer Key
1. (c) Business Understanding. The chapter emphasizes that most failed data science projects fail because the business problem was poorly defined, not because of technical issues.
2. (c) Both are driven by a confounding variable (rainy/cold weather). Rainy or cold weather drives both umbrella purchases and soup purchases.
3. (b) Ordinal. The scale has a meaningful order (Very Dissatisfied < Dissatisfied < Neutral < Satisfied < Very Satisfied) but the intervals between points are not necessarily equal.
4. (c) 60–80%. This is the widely cited estimate for time spent on data preparation in data science projects.
5. (b) Roughly 10 of the 12 might be expected by chance alone. At p < 0.05, testing 200 hypotheses would yield approximately 10 false positives by chance (200 × 0.05 = 10). This is the multiple comparisons problem.
6. (d) Prescriptive analytics. The question asks what action to take and optimizes across multiple objectives (retention probability vs. revenue loss).
7. (c) Reliance on intuition over process. The data science mindset emphasizes process orientation and reproducibility, not intuition.
8. (b) Self-selection bias. High-performing or ambitious employees are more likely to self-select into the mentoring program, so the correlation between participation and performance may not be causal. Mandating participation for all employees would not necessarily produce the same effect.
9. (b) Data with some organizational properties but not in rigid row-and-column format. JSON, XML, email headers, and HTML are common examples.
10. (c) A heavily right-skewed distribution. When most customers are worth $2,000 but a few are worth $50,000+, the average of $8,500 describes neither the typical customer nor the high-value outliers. The median would be far more informative.
11. (b) Evaluation assesses both technical performance and business alignment. The Modeling phase focuses on building and optimizing models; the Evaluation phase asks whether the model actually solves the business problem defined in Phase 1.
12. (c) Data that organizations collect and store but never analyze. Gartner estimates 60–73% of enterprise data is dark data.
13. (b) Regression to the mean. Extreme performance in one period (especially poor) is likely followed by less extreme performance, regardless of intervention, because the extreme performance included a component of luck or randomness.
14. (c) Batch processing handles data in periodic chunks; stream processing handles data continuously in real time. The choice depends on business requirements for timeliness.
15. (b) The model will treat the categories as having a meaningful numeric order and distance. It will interpret "Clothing" as between "Electronics" and "Home & Garden," and treat the difference between 1 and 2 as equivalent to the difference between 2 and 3. Nominal data should be one-hot encoded instead.
16. Statistical significance indicates that a result is unlikely to have occurred by chance (typically p < 0.05). Practical significance indicates that the result is large enough to matter for business purposes. Example: An A/B test with 10 million visitors might find that Button A converts at 2.501% and Button B at 2.503% — statistically significant but a difference of 0.002 percentage points, which has no practical business value.
17. The flaw is confusing correlation with causation due to self-selection bias. People who download a wellness app are already more health-conscious — they would likely have lower claims regardless of the app. The 28% difference reflects who chooses to download the app, not the app's causal effect. A randomized experiment (offering the app to a randomly selected group while withholding it from a control group) would be needed to estimate the app's true causal impact. The projected 28% reduction is almost certainly a dramatic overestimate.
18. The last mile problem is the gap between generating an analytical insight and translating it into changed behavior or improved decisions. Two practices to address it: (1) Embed analytics directly into decision processes — integrate insights into regular decision cadences (pricing reviews, planning cycles) rather than producing standalone reports. (2) Design for action before beginning analysis — define who will act on the findings, what decision the analysis informs, and what format the output needs to be in for the decision-maker.
19. Project B is the stronger choice. At $400,000 additional annual operating cost for only $200,000 in additional annual value, Project A has negative ROI. This connects to the chapter's principle that "the best model is not the most technically sophisticated one — it's the one that best solves the business problem within the constraints of the operating environment." Project B delivers 90% of the value at a sustainable cost. This reflects the Evaluation phase of CRISP-DM, which requires assessing business viability alongside technical performance.
20. A confounding variable is one that influences both the independent and dependent variables, creating a spurious association. For the Starbucks/housing example, the key confounders are population density, urban development, and affluence. Wealthy, dense, urban areas attract both Starbucks locations and high housing prices. Starbucks selects locations based on foot traffic and purchasing power — factors that also correlate with housing costs. Opening a Starbucks in a low-income area would not cause housing prices to rise because the correlation is driven by underlying economic factors, not by Starbucks itself.