Chapter 2 Exercises: Thinking Like a Data Scientist
Section A: Recall and Comprehension
Exercise 2.1 — List the six phases of CRISP-DM in order. For each phase, write one sentence describing its primary purpose.
Exercise 2.2 — Define the following terms in your own words, providing a business example for each: - (a) Confounding variable - (b) Spurious correlation - (c) Confirmation bias - (d) Regression to the mean - (e) Dark data
Exercise 2.3 — Explain the difference between structured and unstructured data. Give three examples of each that might exist in a retail company's data ecosystem.
Exercise 2.4 — Name the four types of business questions in the analytics maturity framework. For each type, provide an example question that a hospital administrator might ask.
Exercise 2.5 — Describe the four measurement scales (nominal, ordinal, interval, ratio). For each scale, identify one variable from an employee database that would be measured at that scale.
Exercise 2.6 — What is the "last mile" problem in analytics? Identify three reasons why analytical insights often fail to drive organizational action.
Section B: Application
Exercise 2.7 — A marketing director presents the following finding: "Customers who read our email newsletter purchase 40% more than customers who don't." She recommends investing $500,000 in expanding the newsletter program to all customers.
(a) Identify at least two confounding variables that could explain this correlation without a causal relationship. (b) Describe a reverse-causation explanation for this finding. (c) Propose an experiment that could help establish whether the newsletter causes increased purchasing.
Exercise 2.8 — You are a data scientist at a mid-size software company. The CEO tells you: "We need to use AI to improve sales." Apply the Business Understanding phase of CRISP-DM to transform this vague directive into a well-defined data science problem. Write: (a) A specific, measurable business objective. (b) The data science problem it translates to. (c) Success criteria — both technical and business. (d) At least three questions you would ask stakeholders before beginning the project.
Exercise 2.9 — Classify each of the following business questions as descriptive, diagnostic, predictive, or prescriptive: - (a) What was our customer acquisition cost by channel last quarter? - (b) Why did renewal rates decline among enterprise clients? - (c) Which of our current customers are most likely to upgrade to the premium tier? - (d) What discount level should we offer each at-risk customer to maximize the probability of renewal while minimizing revenue loss? - (e) How many support tickets were resolved within our SLA target this month? - (f) What combination of pricing, bundling, and promotions will maximize revenue per user over the next 12 months? - (g) What factors explain the difference in conversion rates between our mobile and desktop experiences? - (h) Which job applicants are most likely to accept an offer if extended one?
Exercise 2.10 — For each of the following variables, identify the measurement scale (nominal, ordinal, interval, ratio) and explain why you chose that classification: - (a) Customer Net Promoter Score (0–10 scale) - (b) Product category (Electronics, Clothing, Home & Garden, etc.) - (c) Temperature in a warehouse (measured in Fahrenheit) - (d) Annual revenue - (e) Employee satisfaction ranking (1st, 2nd, 3rd most satisfied department) - (f) Number of customer service calls - (g) ZIP code - (h) Credit rating (AAA, AA, A, BBB, BB, B, etc.)
Exercise 2.11 — A SaaS company reports: "Our average customer lifetime value (CLV) is $14,200." Describe a scenario in which this single number could be highly misleading. What additional information about the distribution would you want, and how would it change your strategic recommendations?
Exercise 2.12 — Map the five stages of the data pipeline to a specific business scenario of your choosing (e.g., a ride-sharing app, an online retailer, a hospital system). For each stage, identify one thing that could go wrong and describe the downstream consequences.
Section C: Analysis and Critical Thinking
Exercise 2.13 — Read the following claim and evaluate it critically, using concepts from this chapter:
"A study of 500 companies found that those with a Chief Data Officer (CDO) had 23% higher revenue growth over three years compared to companies without a CDO."
(a) Identify at least three confounding variables that could explain this relationship. (b) Could reverse causation be at work? Explain. (c) What additional information would you need to assess whether the CDO position causes revenue growth? (d) If you were a CEO deciding whether to create a CDO position, how should you weight this evidence?
Exercise 2.14 — The Athena Retail Group case in this chapter illustrated how three departments each had data supporting their preferred explanation for a customer satisfaction decline. This is sometimes called the "advocacy trap" — using data to advocate for a predetermined conclusion rather than to investigate objectively.
(a) Why is the advocacy trap so common in organizations? (b) What structural or process changes could an organization implement to reduce the advocacy trap? (c) How does hypothesis-driven analysis help mitigate this problem? (d) Can you think of a real-world example (from your own experience or from news coverage) where the advocacy trap led to a poor business decision?
Exercise 2.15 — A product manager presents the following A/B test results:
- Version A (control): 10,000 visitors, 312 conversions (3.12%)
- Version B (variant): 10,000 visitors, 347 conversions (3.47%)
- p-value: 0.18
The product manager says: "Version B is clearly better — it has a higher conversion rate. Let's roll it out." Evaluate this recommendation using the concepts of statistical significance, practical significance, and uncertainty from Section 2.8.
Exercise 2.16 — Consider this scenario: A city government notices that neighborhoods with more police officers have higher crime rates. A city council member proposes reducing police presence in high-crime neighborhoods, arguing that "the data clearly shows police presence causes crime."
(a) Identify the flaw in this reasoning. (b) Name the type of causal error being made. (c) Propose a better analytical approach to understanding the relationship between police presence and crime.
Exercise 2.17 — A venture capital firm uses the following model to evaluate startup investments: they score each company on 50 different metrics, then invest in companies that score in the top 10% across the most metrics. After two years, they notice that their portfolio companies perform no better than the market average, despite having strong scores at the time of investment.
(a) Explain, using the concept of regression to the mean, why this outcome is predictable. (b) How might the multiple comparisons problem (testing 50 metrics) contribute to poor selection? (c) Propose a more rigorous approach to startup evaluation that addresses these issues.
Section D: Research and Exploration
Exercise 2.18 — Research Tyler Vigen's "Spurious Correlations" project (tylervigen.com). Find three correlations that you find particularly amusing or instructive. For each, identify the most likely explanation (coincidence, shared confounder, or methodological artifact).
Exercise 2.19 — Find a real business case study where an organization's failure can be traced to a correlation/causation error. Write a 500-word analysis describing: (a) what the organization believed, (b) what the data actually showed, (c) what went wrong in their reasoning, and (d) what they should have done differently.
Exercise 2.20 — Research one of the following alternative data science methodologies and compare it to CRISP-DM: - (a) TDSP (Team Data Science Process) by Microsoft - (b) OSEMN (Obtain, Scrub, Explore, Model, Interpret) - (c) KDD (Knowledge Discovery in Databases)
In your comparison, address: How many phases does it define? Where does it agree with CRISP-DM? Where does it differ? What does it add that CRISP-DM lacks? What does CRISP-DM include that it doesn't?
Exercise 2.21 — Investigate how a company you admire (or your current employer) handles the "last mile" problem. Through publicly available information (case studies, blog posts, interviews, annual reports) or your own experience, describe: What practices do they use to translate analytical insights into action? What organizational structures support data-driven decision-making? What challenges remain?
Section E: Discussion and Debate
Exercise 2.22 — "In business, acting on correlation is often rational even when causation hasn't been established." Argue both sides of this statement. Under what circumstances is acting on correlation justified? When is it dangerous? How should the stakes of the decision influence the standard of evidence required?
Exercise 2.23 — The chapter argues that the data science mindset emphasizes comfort with uncertainty, while business culture prizes confidence and clear answers. Is this tension resolvable? How should a data scientist communicate probabilistic findings to an executive who demands a yes-or-no recommendation?
Exercise 2.24 — Consider the statement: "Data preparation consumes 60–80% of data science project time." Some argue this is a problem to be solved through automation. Others argue it's actually where the most important intellectual work happens. Take a position and defend it.
Exercise 2.25 — A colleague argues: "We don't need a formal process like CRISP-DM. Our best insights come from smart people playing with data — exploring freely, following their intuition, and seeing what emerges." How would you respond? Is there a role for unstructured exploration in data science? How would you balance the benefits of formal methodology against the potential cost of stifling creativity?
Section F: Integrated Application
Exercise 2.26 — You are the newly hired head of analytics at a regional grocery chain with 85 stores. Customer complaints have risen 20% over the past six months, and the CEO wants answers. Apply the complete CRISP-DM framework to this problem:
(a) Business Understanding: Define the problem precisely. What is the business objective? What would success look like? (b) Data Understanding: List the data sources you would examine. What data might be available? What data might be missing? (c) Data Preparation: What data quality issues would you anticipate? What transformations might be needed? (d) Modeling: What analytical approach would you use? (Note: you don't need to build an actual model — describe your approach conceptually.) (e) Evaluation: How would you evaluate whether your analysis correctly identified the root cause? (f) Deployment: How would you communicate your findings? Who needs to act on them? What organizational changes might be required?
Exercise 2.27 — Select an industry you're interested in (healthcare, financial services, manufacturing, retail, education, etc.). For that industry:
(a) Identify five examples of structured data commonly collected. (b) Identify five examples of unstructured data that exists but may be underutilized. (c) For each unstructured data source, propose a specific business question it could help answer. (d) Classify each business question as descriptive, diagnostic, predictive, or prescriptive. (e) Identify the biggest barrier to extracting value from each unstructured source.
Exercise 2.28 — Design a "data literacy assessment" for non-technical managers at your organization (or a hypothetical one). Create 10 questions that test the concepts covered in this chapter — not memorization, but applied understanding. Include an answer key with explanations for why the correct answer is correct and why the incorrect answers are wrong.