Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
How to use these exercises: Work through the sections in order. Each section builds on the previous one, moving from recall toward original thinking. There are no wrong answers to the open-ended questions — the goal is to practice thinking like a data scientist, even before you write a single line of code.
Difficulty key: ⭐ Foundational | ⭐⭐ Intermediate | ⭐⭐⭐ Advanced | ⭐⭐⭐⭐ Extension
Part A: Conceptual Understanding ⭐
These questions check whether you absorbed the core ideas from the chapter. Aim for clear, concise answers — a few sentences each.
Exercise 1.1 — Defining the field
In your own words, write a two-to-three sentence definition of data science. Then write a one-sentence definition that you could explain to a family member who has never heard the term. How do your two versions differ, and why?
Guidance
Your "family-friendly" version will likely drop jargon and emphasize outcomes (understanding the world, making better decisions) rather than methods. That shift is itself instructive: data science is ultimately about *answering questions with evidence*, not about any particular tool. A strong definition touches on three pillars: (1) extracting knowledge from data, (2) using a combination of statistics, computation, and domain expertise, and (3) communicating findings to inform decisions.Exercise 1.2 — Drawing boundaries
The chapter distinguishes data science from several neighboring fields. Complete the following table from memory, then check your answers against the chapter text.
| Field | Primary focus | Overlaps with data science in... | Key difference from data science |
|---|---|---|---|
| Statistics | |||
| Machine Learning | |||
| Software Engineering | |||
| Business Intelligence |
Guidance
- **Statistics** focuses on inference and uncertainty quantification under formal mathematical frameworks. It overlaps with data science in modeling and hypothesis testing. The key difference is that statistics historically emphasizes theory and proof, while data science emphasizes end-to-end problem solving including data wrangling and communication. - **Machine Learning** focuses on algorithms that learn patterns from data and make predictions. It overlaps in the modeling stage. The difference is that ML is one *tool* within the data science lifecycle — data science also includes question formulation, data collection, cleaning, and storytelling. - **Software Engineering** focuses on building reliable, maintainable software systems. It overlaps in writing code and building data pipelines. The difference is that software engineering optimizes for system reliability, while data science optimizes for insight and decision quality. - **Business Intelligence** focuses on dashboards, reports, and descriptive summaries of historical data. It overlaps in data visualization and communication. The difference is that BI is primarily *backward-looking* (what happened?), while data science also tackles predictive and causal questions.Exercise 1.3 — The lifecycle, from memory
Without looking back at the chapter, list the six stages of the data science lifecycle in order. For each stage, write one sentence describing what happens during that stage.
Guidance
1. **Question formulation** — You define a clear, answerable question tied to a real-world problem. 2. **Data collection** — You identify and gather the data needed to address the question, whether from databases, surveys, APIs, or other sources. 3. **Data cleaning** — You inspect, repair, and transform the raw data so it is consistent, complete enough, and suitable for analysis. 4. **Exploratory analysis** — You summarize and visualize the data to uncover patterns, spot anomalies, and refine your question. 5. **Modeling** — You apply statistical or machine learning techniques to answer the question formally — making predictions, estimating effects, or classifying observations. 6. **Communication** — You translate your findings into a form your audience can understand and act on, including visualizations, reports, or interactive tools. The key insight: these stages are not strictly linear. You will often loop back — for example, modeling may reveal that your data needs more cleaning, or communication may surface a new question.Exercise 1.4 — Three flavors of questions
Classify each of the following as descriptive, predictive, or causal. Briefly justify each classification.
- "How many customers churned last quarter?"
- "Will this patient be readmitted to the hospital within 30 days?"
- "Did the new checkout flow cause an increase in completed purchases?"
- "What is the average commute time for employees at our company?"
- "Which students are at risk of failing this course?"
- "Does offering free shipping lead to higher customer lifetime value?"
Guidance
1. **Descriptive** — It asks "what happened?" using historical data. No prediction or causal claim involved. 2. **Predictive** — It asks about a future outcome for a specific patient. The goal is forecasting, not explaining *why*. 3. **Causal** — The word "cause" is a giveaway, but even rephrased ("Did the new flow increase purchases?"), this is asking whether one thing *produced* another. Answering it properly requires experimental or quasi-experimental design. 4. **Descriptive** — A summary statistic about the current state of the world. 5. **Predictive** — It identifies students likely to experience a future event (failing). Note: it does *not* ask why they might fail. 6. **Causal** — "Does X lead to Y?" is a causal question. You would need a controlled experiment or careful observational methodology to answer it convincingly.Exercise 1.5 — Structured vs. unstructured
For each data source below, state whether it is primarily structured, unstructured, or semi-structured. Then identify one challenge you would face in working with it.
- A hospital's electronic health records stored in a relational database
- A collection of 50,000 customer reviews scraped from an e-commerce site
- Server log files recording every HTTP request to a website
- A folder of 10,000 photographs from a wildlife camera trap
- A spreadsheet of monthly sales figures for 200 retail stores
Guidance
1. **Structured** — Rows and columns with defined types. Challenge: missing values, inconsistent coding across departments (e.g., "M" vs. "Male"), and privacy restrictions. 2. **Unstructured** — Free-form text with no fixed schema. Challenge: extracting meaning from natural language — sarcasm, misspellings, varying length, multiple languages. 3. **Semi-structured** — Log files follow patterns (timestamps, status codes) but are not stored in neat tables. Challenge: parsing irregular formats, handling entries that break the expected pattern, and sheer volume. 4. **Unstructured** — Raw image data. Challenge: images vary in lighting, angle, and resolution; identifying species requires either manual labeling or a trained computer vision model. 5. **Structured** — Classic tabular data. Challenge: potential inconsistencies across stores (different fiscal calendars, missing months, currency differences if international).Exercise 1.6 — Domain knowledge matters
The chapter emphasizes that data science is not just about algorithms — domain knowledge is essential. For each of the four anchor characters, identify one piece of domain knowledge they would need that a generic data scientist might lack.
- Elena (public health epidemiologist)
- Marcus (business analyst at a retail company)
- Priya (sports journalist covering basketball)
- Jordan (education researcher studying grading bias)
Guidance
Strong answers are specific, not vague. Examples: - **Elena** needs to understand how diseases spread (incubation periods, transmission routes), how public health surveillance systems collect data, and what reporting delays are typical — otherwise she might mistake a reporting lag for a genuine decline in cases. - **Marcus** needs to understand retail seasonality (holiday spikes, back-to-school), inventory management constraints, and how promotions interact with customer behavior — a model that ignores Black Friday will produce baffling results every November. - **Priya** needs to understand basketball strategy (pick-and-roll efficiency, pace-adjusted statistics), the difference between regular season and playoff dynamics, and how rule changes affect statistical comparisons across eras. - **Jordan** needs to understand how grading works in practice (rubrics vs. holistic scoring, grade inflation trends, the role of participation grades), the demographics of the student population, and the institutional context — a pattern that looks like bias might reflect course selection differences.Exercise 1.7 — Data literacy for everyone
The chapter argues that data literacy is becoming a basic skill, not just for data scientists. Write a short paragraph (4-6 sentences) arguing for or against this claim. Support your position with at least one concrete example from everyday life.
Guidance
There is no single correct answer, but strong responses engage with specifics. For example, a "for" argument might point out that during a pandemic, every citizen encounters case-rate charts, vaccine efficacy numbers, and risk estimates — misinterpreting "95% effective" as "5% of vaccinated people will get sick" is a real and consequential data literacy failure. A "against" argument might note that expecting everyone to evaluate regression output is unrealistic and that better data *communication* by experts is a more practical solution. Either position can be well-argued.Exercise 1.8 — Lifecycle in action
Choose one of the four anchor examples (Elena, Marcus, Priya, or Jordan) and map their project onto the six-stage data science lifecycle. For each stage, write one specific sentence describing what they would do — not a generic description, but something tied to their particular problem.
Guidance
Here is an example using **Jordan** (grading bias): 1. **Question** — "Do students from underrepresented racial groups receive systematically lower grades than peers with similar academic performance, after controlling for prior achievement and course difficulty?" 2. **Data collection** — Gather five years of anonymized transcript data, including grades, demographic information, standardized test scores, and course enrollment records from the university registrar. 3. **Data cleaning** — Handle missing demographic fields, reconcile different grading scales across departments (some use +/- and some do not), and remove courses with fewer than 10 enrolled students to protect anonymity. 4. **Exploratory analysis** — Compute average GPAs by demographic group and department, visualize grade distributions, and check whether apparent gaps persist after accounting for course difficulty. 5. **Modeling** — Fit a regression model predicting course grade from prior GPA, course difficulty, and demographic variables, looking at whether demographic coefficients are statistically and practically significant. 6. **Communication** — Present findings to the faculty senate with clear visualizations, confidence intervals, and a discussion of limitations (e.g., unobserved confounders like study time). Your answer for a different character should be comparably specific.Part B: Applied Analysis ⭐⭐
These problems give you a scenario and ask you to apply the chapter's frameworks. Think carefully before answering — many of these have subtleties.
Exercise 1.9 — Is this data science?
For each scenario below, decide whether it qualifies as a data science project. If yes, identify the primary lifecycle stage being performed. If no, explain what it is instead (e.g., software engineering, business intelligence, pure statistics).
- A team builds an automated pipeline that moves data from a payment processor into a data warehouse every night.
- A researcher publishes a mathematical proof that a new estimator converges faster than existing alternatives under certain conditions.
- A marketing analyst creates a dashboard showing last month's website traffic by region and device type.
- A nonprofit analyst investigates whether its after-school tutoring program actually improved students' test scores, using a matched comparison group.
- A data engineer optimizes a Spark cluster to reduce query processing time from 4 hours to 20 minutes.
Guidance
1. **Not data science** — This is **data engineering**. Building pipelines is essential infrastructure, but it is not itself asking or answering a question. (It supports stage 2, data collection, but it is not *doing* data science.) 2. **Not data science** — This is **mathematical statistics / statistical theory**. There is no data, no applied question, and no lifecycle. The proof may eventually *inform* data science methodology, but the activity itself is theoretical. 3. **Borderline — closer to business intelligence**. A descriptive dashboard summarizes what happened. It becomes data science if the analyst goes further: investigating *why* traffic dipped, or predicting next month's numbers. As described, it is standard BI/reporting. 4. **Yes — data science.** This hits multiple lifecycle stages: a clear causal question, data collection (test scores and program participation), analysis, and modeling (matched comparison). The primary stage being performed is modeling/analysis. 5. **Not data science** — This is **data engineering / infrastructure optimization**. Important work, but the goal is system performance, not insight.Exercise 1.10 — Reformulating vague questions
Each question below is too vague to be actionable. Rewrite each as a specific, answerable data science question. State what data you would need to answer your revised question and classify it as descriptive, predictive, or causal.
- "Is our marketing working?"
- "Are schools getting worse?"
- "What's the deal with crime in this city?"
Guidance
Strong reformulations are *specific*, *measurable*, and tied to *available data*. Examples: 1. **Vague:** "Is our marketing working?" **Revised:** "Among customers who received our email campaign in Q3 2025, was the 30-day purchase rate higher than among a comparable group who did not receive it?" (Causal) **Data needed:** Email send logs, purchase records, customer demographics for matching. 2. **Vague:** "Are schools getting worse?" **Revised:** "How have average 8th-grade math proficiency rates on the NAEP changed in U.S. public schools between 2010 and 2025, and do trends differ by school funding level?" (Descriptive, with a comparative element) **Data needed:** NAEP scores by year and school, school funding data from NCES. 3. **Vague:** "What's the deal with crime in this city?" **Revised:** "Which three neighborhoods experienced the largest year-over-year increase in reported property crimes in 2025, and what demographic or economic factors distinguish them from neighborhoods where property crime declined?" (Descriptive + exploratory) **Data needed:** Geocoded crime reports from the police department, census demographic data by neighborhood.Exercise 1.11 — The wrong question
Marcus's manager asks him: "Use data science to prove that our new loyalty program is increasing revenue." Identify at least two problems with this request from a data science perspective. Then suggest how Marcus could reframe the project to make it scientifically sound.
Guidance
Problems include: 1. **The conclusion is predetermined.** "Prove that it's increasing revenue" is advocacy, not inquiry. Good data science starts with a question, not a desired answer. Marcus should investigate whether the program is working, not set out to confirm that it is. 2. **Causal claims require causal methods.** Even if revenue went up after the loyalty program launched, that does not mean the program caused the increase. Seasonality, economic trends, or a simultaneous price change could explain the increase. Marcus would need a control group or a quasi-experimental design (e.g., comparing loyalty members to similar non-members). 3. **"Revenue" is underspecified.** Total revenue? Revenue per customer? Revenue from loyalty members specifically? The metric matters. **Better framing:** "Among customers eligible for the loyalty program, did those who enrolled spend more over the following six months than comparable customers who did not enroll, after adjusting for pre-enrollment spending patterns and demographics?"Exercise 1.12 — Matching lifecycle stages to activities
Below is a scrambled list of activities from Elena's public health project. Assign each to the correct lifecycle stage.
| Activity | Lifecycle stage? |
|---|---|
| Elena creates a chloropleth map showing infection rates by county to present to city council. | |
| Elena downloads three years of hospital admission records from the state health department portal. | |
| Elena notices that 12% of records are missing the patient's zip code and decides to impute them using hospital location. | |
| Elena wonders whether a recent factory closure is connected to a spike in respiratory illness. | |
| Elena fits a time-series model to estimate how many ER visits to expect next month. | |
| Elena calculates the average age of patients and plots the distribution of admission dates to look for seasonal patterns. |
Guidance
1. Chloropleth map for city council → **Communication** 2. Downloading hospital records → **Data collection** 3. Imputing missing zip codes → **Data cleaning** 4. Wondering about the factory closure → **Question formulation** 5. Time-series model for next month → **Modeling** 6. Average age and seasonal pattern plots → **Exploratory analysis**Exercise 1.13 — Stakeholder translation
Priya has finished analyzing player efficiency data and found that a particular basketball team's bench players contribute significantly more per minute than the starters in the fourth quarter. She needs to communicate this to three different audiences. For each audience below, describe in 2-3 sentences how she should present the finding differently.
- Her editor at the sports news outlet (who needs a story angle)
- The team's head coach (who needs actionable information)
- A general audience on social media (who wants entertainment)
Guidance
1. **Editor:** Lead with the narrative hook — "This team has a secret weapon hiding on its bench." Emphasize the surprising contrast between starter and bench performance. Provide enough statistical context to be credible but focus on the story. 2. **Coach:** Lead with the actionable insight — specific players, specific game situations, and magnitude of the effect. Include confidence intervals or caveats (small sample size? only in blowouts?). Skip the narrative framing. 3. **Social media:** Lead with a single striking statistic or visualization (e.g., "Their bench outscored their starters by 8 points per 36 minutes in crunch time"). Keep it short, visual, and shareable. Save the methodology for a thread or linked article. The deeper point: communication is not one-size-fits-all. The same finding requires different framing for different audiences.Part C: Real-World Application ⭐⭐-⭐⭐⭐
These exercises connect chapter concepts to the messy real world. You may need to think beyond the chapter text.
Exercise 1.14 — Data science in the headlines
Find a recent news article (from the past year) that describes a data-driven finding or decision. It could be about public health, sports, business, politics, climate, or anything else. Then answer:
- What question were the data scientists (or analysts, or researchers) trying to answer?
- Classify the question as descriptive, predictive, or causal.
- What data did they use? Was it structured or unstructured?
- Can you identify which lifecycle stage the article focuses on? Which stages are invisible in the reporting?
- What is one limitation or caveat that the article either mentions or should have mentioned?
Guidance
This is an open-ended exercise with no single correct answer. The goal is to practice applying the chapter's framework to real examples. Common pitfalls to avoid: - Choosing an article that is about *technology* (e.g., "Company launches AI chatbot") rather than about *data-driven findings*. Look for articles that report a specific result or insight. - Identifying the question too broadly. "They studied climate change" is not specific enough. "They estimated that Arctic sea ice volume declined 13% per decade between 1979 and 2024" is better. - Most articles focus on the *findings* (modeling + communication stages) and make the data collection and cleaning invisible. Noting this is a good observation.Exercise 1.15 — When data science goes wrong
Each scenario describes a real-world situation where a data science project produced misleading or harmful results. For each, identify the root cause of the failure and which lifecycle stage it most closely relates to.
- A predictive policing algorithm directed more patrols to historically over-policed neighborhoods, leading to more arrests there, which in turn made the algorithm even more confident those neighborhoods were high-crime areas.
- A hospital developed a model to predict which patients were most likely to need extra care. The model used total healthcare spending as a proxy for health needs — but because of systemic disparities in access, Black patients with the same severity of illness had historically spent less, so the model systematically underestimated their needs.
- A social media company reported that its new feature "increased engagement by 20%." Later investigation revealed that "engagement" included accidental clicks, rage-clicks on controversial content, and bots.
Guidance
1. **Root cause:** A feedback loop where biased historical data reinforced itself. **Lifecycle stage:** Data collection (the training data reflected enforcement patterns, not true crime rates) and Question formulation (the question "where will crime occur?" was operationalized as "where have arrests occurred?", which is a different question). 2. **Root cause:** A flawed proxy variable. Healthcare spending does not mean the same thing for all demographic groups. **Lifecycle stage:** Data cleaning / feature engineering — the team used a variable without interrogating whether it measured what they assumed it measured. Also a failure of domain knowledge. 3. **Root cause:** A poorly defined metric. **Lifecycle stage:** Question formulation — "Did engagement increase?" is only meaningful if "engagement" is defined in a way that captures genuine user value. This is also a communication failure if the 20% figure was presented without the definitional caveats.Exercise 1.16 — The data you don't have
For each scenario, identify what important data is missing and explain how its absence could distort the analysis.
- Elena is studying whether air quality affects asthma hospitalization rates. She has hospital admission data and EPA air quality readings — but only from monitoring stations, which are unevenly distributed.
- Marcus is analyzing customer satisfaction using online reviews. His dataset contains 50,000 reviews from customers who voluntarily left feedback.
- Jordan is investigating grading bias by comparing grades across demographic groups. The university's records do not include information about students' socioeconomic background.
Guidance
1. **Missing:** Air quality data for areas far from monitoring stations — often lower-income or rural communities. **Distortion:** Elena might underestimate pollution exposure in areas without monitors, making air quality look less harmful than it actually is. This is an example of *measurement bias* driven by infrastructure gaps. 2. **Missing:** The experiences of customers who *did not* leave reviews. **Distortion:** People who leave reviews are disproportionately very satisfied or very dissatisfied (a phenomenon called *voluntary response bias* or *selection bias*). The 50,000 reviews may not represent the experience of the typical customer at all. 3. **Missing:** Socioeconomic status (family income, first-generation status, work hours). **Distortion:** If grading differences across racial groups are partly explained by socioeconomic factors (e.g., students working full-time have less study time), Jordan cannot distinguish racial bias from socioeconomic effects. This is an *omitted variable* problem — a confound that cannot be controlled for because it was never measured.Exercise 1.17 — Structured and unstructured in the same project
Priya is writing a feature article about whether NBA referees make different foul calls depending on the score differential in the fourth quarter. Describe at least three different data sources she might use, classify each as structured or unstructured, and explain how she would need to combine them.
Guidance
Possible data sources: 1. **Play-by-play data from the NBA's stats API** — Structured. Each row is an event (foul, shot, turnover) with timestamps, player IDs, and score at the time of the event. 2. **Referee assignment records** — Structured. Which referees officiated each game. Available from official NBA data. 3. **Game video footage** — Unstructured. Priya might review footage of controversial calls to provide narrative examples for her article, or to verify that the play-by-play data accurately categorizes fouls. 4. **Post-game press conference transcripts or referee pool reports** — Unstructured text. Might contain referee or coach comments about officiating standards. 5. **Social media commentary** — Unstructured. Fan and analyst reactions could indicate which games had perceived officiating issues, providing leads for investigation. To combine them, Priya would use the structured play-by-play data as her primary analytical dataset, join it with referee assignments by game ID, and use the unstructured sources (video, transcripts) for context, validation, and storytelling. The structured data answers "what happened statistically" while the unstructured data answers "what does it look like and what do people think about it."Exercise 1.18 — Ethics preview
The chapter mentions Jordan's investigation into grading bias. Even though ethics is covered in depth later in the book, consider: what are two ethical considerations Jordan should think about before beginning data collection? Why is it important to think about ethics at the question formulation stage, not just after results are in?
Guidance
Two ethical considerations (among many possible answers): 1. **Privacy and re-identification risk.** Even with anonymized data, small class sizes or unique demographic combinations could allow someone to identify individual students or instructors. Jordan needs to plan for this *before* accessing the data — for example, by suppressing results for groups smaller than a threshold. 2. **Potential for harm regardless of outcome.** If the study finds evidence of bias, it could be used to punish individual instructors without context. If it finds no evidence, it could be used to dismiss legitimate student concerns. Jordan should think about how the results will be interpreted and by whom. Why think about ethics early? Because the *question you ask* shapes everything downstream. If Jordan frames the question as "Which professors are biased?" rather than "Are there systemic patterns that suggest structural inequity?", the analysis, findings, and impact will all be different. Ethical data science begins at stage one.Part D: Synthesis & Critical Thinking ⭐⭐⭐
These problems require you to connect ideas across sections, challenge assumptions, or construct original arguments.
Exercise 1.19 — The lifecycle is a lie (sort of)
The chapter presents the data science lifecycle as six sequential stages. But the chapter also notes that real projects are rarely linear. Write a short essay (one to two paragraphs) arguing that the "lifecycle" metaphor is misleading. What would be a better metaphor? Then write a counter-argument: why is the sequential model still useful even if it's imperfect?
Guidance
**Argument against linearity:** Real projects involve constant backtracking. You start modeling and realize your data is dirtier than you thought — back to cleaning. You present results and a stakeholder asks a new question — back to formulation. You collect data and discover it doesn't exist in the form you imagined — back to question refinement. A better metaphor might be a *web*, a *spiral*, or a *conversation* — something that captures the iterative, non-linear nature of the work. **Counter-argument:** The sequential model is a *pedagogical scaffold*. Beginners need a mental map before they can navigate the territory. Saying "it's all connected and iterative" is true but unhelpful to someone who doesn't yet know what the stages *are*. The linear model teaches you the vocabulary and the components; experience teaches you that they interleave. Think of it like learning to drive: you first learn the steps (mirrors, signal, maneuver) in sequence, even though experienced drivers do them fluidly and simultaneously.Exercise 1.20 — Cross-domain transfer
Elena (public health) and Marcus (business) seem to work in completely different domains. Identify three specific techniques, concepts, or challenges that they share despite their different fields. For each, explain why the commonality exists.
Guidance
Possible shared elements (many valid answers exist): 1. **Dealing with missing data.** Elena's hospital records have missing zip codes; Marcus's customer database has missing purchase histories for customers who use cash. Both must decide whether to discard incomplete records, impute missing values, or adjust their analysis — and both must worry about whether the data is missing *at random* or for a systematic reason. 2. **Distinguishing correlation from causation.** Elena wants to know if a factory closure *caused* a health spike; Marcus wants to know if a loyalty program *caused* revenue growth. Both face the challenge that other variables could explain the observed patterns. Both need either experimental designs or careful statistical adjustments. 3. **Communicating uncertainty to non-technical stakeholders.** Elena presents to city council; Marcus presents to executives. Both must convey not just their best estimate but how confident they are — and both face audiences that may prefer certainty to nuance. The commonality exists because the *structure* of data science problems is similar across domains, even when the *content* differs. This is precisely why data science is a discipline, not just "statistics applied to business" or "statistics applied to health."Exercise 1.21 — Building a flawed argument
A news headline reads: "Study shows people who eat chocolate daily live 3 years longer." A friend cites this as proof that they should eat more chocolate. Using concepts from this chapter, construct a careful critique. Your critique should:
- Identify the type of question being asked (and whether the headline matches it)
- Identify at least two plausible alternative explanations
- Suggest what kind of study would actually support the causal claim
Guidance
**Type of question:** The headline implies a *causal* claim ("eating chocolate leads to longer life"), but the underlying study is almost certainly *observational* and at best supports a *descriptive* or *correlational* finding ("people who eat chocolate daily also tend to live longer"). **Alternative explanations:** 1. **Wealth as a confound.** People who eat chocolate daily may be wealthier (chocolate is a discretionary purchase), and wealth is strongly correlated with longevity through better healthcare, lower stress, and healthier environments. The chocolate is incidental. 2. **Healthy-user bias.** People who are already health-conscious may allow themselves a small daily indulgence like dark chocolate as part of an overall healthy lifestyle that includes exercise and good nutrition. The lifestyle — not the chocolate — drives longevity. 3. **Survivorship bias.** If the study only surveyed living people about their habits, it cannot account for chocolate-eaters who already died. **Better study design:** A randomized controlled trial in which participants are randomly assigned to eat chocolate daily or not, with long-term follow-up. This is practically difficult (you cannot easily randomize and monitor chocolate consumption for decades), which is precisely why the causal claim is hard to establish.Exercise 1.22 — Designing a question for each type
Choose a single topic that interests you — climate, music, health, education, anything. Write three data science questions about that topic: one descriptive, one predictive, and one causal. For each question, identify one specific dataset or data source that could help answer it.
Guidance
Here is an example using **urban transportation**: 1. **Descriptive:** "What percentage of commuters in Chicago used public transit vs. personal vehicles vs. bicycles in 2025?" Data source: American Community Survey (Census Bureau). 2. **Predictive:** "Based on current trends, how many daily Divvy bike-share rides should Chicago expect during July 2026?" Data source: Divvy historical trip data (publicly available). 3. **Causal:** "Did the introduction of a protected bike lane on Milwaukee Avenue cause an increase in cycling commuters on that corridor?" Data source: Before-and-after bike counter data on Milwaukee Avenue and on a comparable corridor without a new lane (as a control). Your answers should be comparably specific. Vague questions like "What's happening with transportation?" do not count.Part M: Mixed Practice
This section is intentionally omitted for Chapter 1. In later chapters, Part M will include problems that blend current and previous material to build cumulative fluency. Since this is the first chapter, there is no prior material to mix in yet. Starting in Chapter 2, expect 4-6 mixed practice problems here.
Part E: Research & Extension ⭐⭐⭐⭐
These are open-ended projects that go beyond the chapter. They have no single correct answer. Spend 30-60 minutes on one, or return to them later in the course.
Exercise 1.23 — Data science origin story
Research the history of the term "data science." When did it first appear? Who popularized it? How has its meaning shifted over time? Write a 300-500 word summary that traces the term from its origins to its current usage. Your summary should reference at least three specific milestones or publications.
Guidance
Key milestones to look for (your research may surface others): - **1962:** John Tukey's paper "The Future of Data Analysis" argued that statistics should be more empirical and computational — a vision that anticipates modern data science. - **1996:** The International Federation of Classification Societies conference used the term "Data Science" in its title. - **1997:** C.F. Jeff Wu proposed renaming statistics as "data science" in his inaugural lecture. - **2001:** William S. Cleveland published "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics." - **2008-2012:** The explosion of "big data," Hadoop, and the Harvard Business Review calling data scientist "the sexiest job of the 21st century" (2012, Davenport and Patil) brought the term into mainstream usage. - **2010s-present:** The meaning broadened to encompass machine learning, AI, and analytics, leading to the ongoing debate about what "data science" actually means. A strong summary does not just list events — it tells a *story* about how the field's identity has evolved.Exercise 1.24 — Interview a data practitioner
Find someone who works with data professionally — a data scientist, data analyst, business analyst, data engineer, statistician, or researcher. (They do not need the job title "data scientist.") Ask them the following questions and write up their responses in 1-2 pages:
- How would you describe what you do to someone outside your field?
- What does a typical project look like for you? Does it follow a lifecycle like the one described in this chapter?
- What is the hardest part of your job? (Most people do not say "building models.")
- What do you wish you had known before entering this field?
Guidance
This exercise has no "correct" answer — its purpose is to connect textbook concepts to lived experience. Common themes you may hear: - Most practitioners spend far more time on data cleaning and communication than on modeling. - The lifecycle exists in practice, but it is messier and more iterative than any diagram suggests. - Domain knowledge and communication skills are consistently undervalued by newcomers and consistently cited as critical by experienced practitioners. - Many data professionals did not follow a linear career path into the field. If you cannot find someone to interview in person, look for published interviews, podcast episodes, or "day in the life" blog posts by data professionals.Exercise 1.25 — Design your own anchor example
The chapter uses four anchor characters (Elena, Marcus, Priya, Jordan) whose projects recur throughout the book. Create a fifth anchor character for a domain not already covered. Your character description should include:
- A name and professional context (who are they, where do they work?)
- A specific question they want to answer using data
- What kind of data they would need (structured? unstructured? both?)
- Which lifecycle stage they would find most challenging, and why
- One ethical consideration relevant to their project
Write your character description in 200-300 words, formatted as a short narrative paragraph followed by a bulleted breakdown.
Guidance
A strong answer picks a domain with interesting data science challenges and creates a character whose problem is specific enough to be realistic. Avoid domains that are too similar to the existing four (health, business, sports, education). Good candidates include: - **Environmental science** (e.g., a marine biologist tracking ocean plastic) - **Urban planning** (e.g., a city analyst optimizing bus routes) - **Criminal justice reform** (e.g., a policy analyst evaluating bail reform) - **Agriculture** (e.g., a farmer using sensor data to optimize irrigation) - **Music industry** (e.g., a label analyst predicting which songs will chart) - **Humanitarian aid** (e.g., an NGO worker allocating disaster relief resources) The character should feel like a real person with a specific problem — not a generic archetype.End of Chapter 1 Exercises. If you found Parts A and B comfortable and Parts C and D challenging, you are in exactly the right place. The goal is not to get everything right — it is to start building the habit of thinking critically about data.