Chapter 1 Quiz: What Is Data Science? (And What It Isn't)

Contributors to Introduction to Data Science

Chapter 1 Quiz: What Is Data Science? (And What It Isn't)

Instructions: This quiz tests your understanding of Chapter 1. Answer all questions before checking the solutions. For multiple choice, select the best answer — some options may be partially correct. For short answer questions, aim for 2-4 clear sentences. Total points: 100.

Section 1: Multiple Choice (10 questions, 4 points each)

Question 1. Which of the following best describes data science?

(A) Building machine learning models to make predictions
(B) An interdisciplinary field that uses data to answer questions and inform decisions through a process spanning collection, analysis, modeling, and communication
(C) The statistical analysis of large datasets using computers
(D) Creating visualizations and dashboards to summarize business metrics

Answer

**Correct: (B)** - **(A)** is too narrow — machine learning is one tool within data science, not the whole field. A project that never builds an ML model can still be data science (e.g., a descriptive analysis or a controlled experiment). - **(B)** captures the interdisciplinary nature, the focus on answering questions, and the end-to-end process from data to decisions. This matches the chapter's definition. - **(C)** is closer to computational statistics. It misses the emphasis on domain knowledge, communication, and the full lifecycle. - **(D)** describes business intelligence, which is one activity that overlaps with data science but is not synonymous with it.

Question 2. A hospital administrator says: "Patients readmitted within 30 days cost us $2.1 million last year." What type of question does this statement answer?

(A) Predictive
(B) Causal
(C) Descriptive
(D) Prescriptive

Answer

**Correct: (C)** This statement summarizes *what happened* using historical data — a classic descriptive finding. It does not predict future readmissions (predictive), claim that something *caused* the readmissions (causal), or recommend a course of action (prescriptive). Descriptive questions are the foundation: you need to know what happened before you can predict or explain why.

Question 3. Which lifecycle stage do most practicing data scientists report spending the most time on?

(A) Building and tuning models
(B) Collecting and cleaning data
(C) Communicating results to stakeholders
(D) Formulating the initial question

Answer

**Correct: (B)** Surveys consistently show that data scientists spend 60-80% of their time on data collection and cleaning. This surprises many beginners, who assume modeling is the main event. In reality, real-world data is messy — missing values, inconsistent formats, duplicates, errors — and getting it into analyzable shape is the bulk of the work. As the chapter notes, this is the least glamorous but most critical stage.

Question 4. Elena, the public health researcher, wants to know whether a factory closure is connected to a spike in respiratory illness. This is an example of a:

(A) Descriptive question
(B) Predictive question
(C) Causal question
(D) Classification question

Answer

**Correct: (C)** The word "connected" might suggest correlation, but Elena's underlying question is causal: did the factory closure (or the conditions surrounding it) *cause* the respiratory illness spike? Answering this requires more than showing the two events coincide — she would need to rule out alternative explanations (seasonal flu, other pollution sources, changes in reporting). Causal questions are the hardest type to answer convincingly.

Question 5. Which of the following is the best example of unstructured data?

(A) A CSV file containing monthly temperatures for 50 cities
(B) A relational database table of employee records
(C) A collection of 10,000 handwritten customer complaint letters that have been scanned as images
(D) A JSON file containing product catalog information with nested categories

Answer

**Correct: (C)** - **(A)** is structured — rows, columns, numeric values. - **(B)** is structured — a relational table with a defined schema. - **(C)** is unstructured — scanned images of handwritten text have no predefined schema, no rows or columns. Extracting information requires OCR and possibly handwriting recognition. - **(D)** is semi-structured — JSON has some organization (keys and values) but is not a flat table with a fixed schema.

Question 6. Marcus discovers that customers who buy premium coffee also tend to buy organic milk. He concludes that stocking premium coffee causes people to buy organic milk. What error is Marcus making?

(A) Survivorship bias
(B) Confusing correlation with causation
(C) Selection bias
(D) Overfitting his model

Answer

**Correct: (B)** The fact that two purchases co-occur (correlation) does not mean one causes the other. A likely explanation is that a *third variable* — such as income level or health-consciousness — drives both purchases. Wealthier or health-conscious customers may independently prefer both premium coffee and organic milk. This is one of the most common errors in data interpretation, and the chapter emphasizes it as a key reason the descriptive-predictive-causal distinction matters.

Question 7. Why does the chapter emphasize domain knowledge as a core component of data science?

(A) Because data scientists need to be experts in every field they work in
(B) Because domain knowledge helps you ask better questions, choose appropriate methods, interpret results correctly, and avoid misleading conclusions
(C) Because statistical methods do not work without domain-specific modifications
(D) Because employers prefer to hire people with industry experience over people with technical skills

Answer

**Correct: (B)** - **(A)** is too strong — data scientists do not need to be *experts* in every domain, but they need enough understanding to collaborate effectively with domain experts. - **(B)** captures the chapter's argument precisely. Domain knowledge affects every lifecycle stage: formulating meaningful questions, recognizing data quality issues, selecting appropriate models, and interpreting results in context. - **(C)** is false — statistical methods are general, though their *application* benefits from domain knowledge. - **(D)** may be true in some job markets but is not the chapter's argument. The point is about analytical quality, not hiring preferences.

Question 8. Priya wants to predict which NBA players will be named All-Stars next season. She has ten years of player statistics. Which lifecycle stage is she about to enter?

(A) Question formulation
(B) Data collection
(C) Modeling
(D) Communication

Answer

**Correct: (C)** Priya already has her question ("Who will be named All-Star?") and her data (ten years of statistics). Assuming the data has been cleaned and explored (stages 3 and 4), she is now ready to build a predictive model — the modeling stage. She has not yet produced results to communicate, so (D) is premature.

Question 9. Which statement best captures the relationship between data science and machine learning?

(A) They are different names for the same field
(B) Machine learning is a broader field that includes data science as a subfield
(C) Machine learning is one set of tools used within the data science lifecycle, primarily at the modeling stage
(D) Data science is the theoretical foundation, and machine learning is the practical application

Answer

**Correct: (C)** Data science encompasses the entire lifecycle from question to communication. Machine learning provides powerful algorithms for the modeling stage — but a data science project might use simple summary statistics instead of ML and still be valid data science. Conversely, ML research (developing new algorithms, proving convergence properties) can exist independently of the data science lifecycle. They overlap significantly but are not identical, and neither contains the other entirely.

Question 10. Jordan discovers that students in 8 a.m. classes receive lower average grades than students in afternoon classes. Before concluding anything, Jordan should first:

(A) Build a more complex model with more variables
(B) Consider whether the students who enroll in 8 a.m. classes differ systematically from those who choose afternoon classes
(C) Collect more data from additional semesters
(D) Present the finding to the faculty senate immediately

Answer

**Correct: (B)** Before drawing conclusions, Jordan needs to consider *confounding variables*. Students who enroll in early classes may differ in important ways — freshmen with less schedule flexibility, students who work evening jobs, or students who registered late and got leftover slots. The grade difference might reflect *who takes early classes*, not *what the early time slot does to grades*. This is a core data science habit: always ask "what else could explain this pattern?" before making claims. - **(A)** is premature — adding complexity without understanding the data can make things worse. - **(C)** might help eventually but does not address the immediate issue of confounding. - **(D)** is irresponsible without further investigation.

Section 2: True/False with Justification (4 questions, 5 points each)

For each statement, indicate whether it is True or False, then write 1-2 sentences justifying your answer. A correct True/False answer without justification earns only 2 of 5 points.

Question 11. "A project that uses only simple bar charts and averages — no machine learning, no regression — cannot be considered data science."

Answer

**False.** Data science is defined by the *process* (question, data, analysis, communication) and the *purpose* (extracting insight to inform decisions), not by the sophistication of the techniques used. A well-executed descriptive analysis with simple visualizations that answers a meaningful question and communicates findings effectively is absolutely data science. Some of the most impactful data science work — John Snow's cholera map, Florence Nightingale's mortality diagrams — used only basic methods.

Question 12. "The question formulation stage is the most important stage of the data science lifecycle because every subsequent stage depends on it."

Answer

**True (with nuance).** If you ask the wrong question, even perfect data and brilliant models will produce irrelevant or misleading answers. The question shapes what data you collect, how you clean it, what methods you use, and how you interpret results. That said, "most important" is somewhat subjective — a perfectly formulated question is useless if the data collection or communication fails catastrophically. The stronger claim is that question formulation has the highest *leverage*: getting it wrong cascades through everything downstream.

Question 13. "Structured data is always more useful for data science than unstructured data."

Answer

**False.** The usefulness of data depends on the question being asked, not on whether the data is structured. If you are trying to understand customer sentiment, a collection of free-text reviews (unstructured) may be far more informative than a table of purchase amounts (structured). Many of the richest data sources in the world — text, images, audio, video — are unstructured. The challenge is that unstructured data requires more processing to analyze, but "harder to work with" does not mean "less useful."

Question 14. "If a predictive model achieves 95% accuracy, it is ready to be deployed in a real-world decision-making system."

Answer

**False.** High accuracy alone does not mean a model is ready for deployment. Several critical considerations remain: (1) *What does it get wrong?* In a medical screening context, 95% accuracy could mean 5% of sick patients are missed — potentially catastrophic. (2) *Is the accuracy consistent across subgroups?* The model might be 99% accurate for one demographic group and 70% for another. (3) *Was it evaluated on truly representative, held-out data?* If accuracy was measured on the training set, it may be wildly optimistic. (4) *Are the consequences of errors acceptable?* Accuracy is a single number that hides many important details. Responsible deployment requires understanding error types, fairness, and the decision context.

Section 3: Short Answer (3 questions, 6 points each)

Answer in 2-4 complete sentences. Clarity and precision matter more than length.

Question 15. Explain the difference between a predictive question and a causal question using a single concrete example. Why does the distinction matter in practice?

Answer

**Sample answer:** A predictive question about employee attrition might ask, "Which employees are most likely to leave in the next six months?" while a causal question would ask, "Does offering flexible work schedules reduce employee turnover?" The predictive question seeks to *forecast* an outcome — it does not care *why* people leave, only *who* is likely to. The causal question asks whether a specific intervention *produces* a specific effect. The distinction matters because the methods differ: prediction can work with correlations and patterns, while causal claims require controlled experiments or careful quasi-experimental designs to rule out alternative explanations. Treating a predictive finding as causal (e.g., "our model says unhappy employees leave, so making them happy will fix retention") can lead to misguided interventions.

Question 16. The chapter presents four anchor characters working in different domains. Pick any two and explain one challenge they share despite working in different fields.

Answer

**Sample answer:** Elena (public health) and Jordan (education) both face the challenge of *missing or biased data that is not missing at random*. Elena's hospital records may underrepresent populations that avoid hospitals due to cost or immigration status, meaning her data systematically excludes the people most affected by the health issue she studies. Similarly, Jordan's grading data may not capture students who dropped courses before receiving a grade — and if dropping out is correlated with the bias Jordan is investigating, the remaining data presents an incomplete and potentially misleading picture. In both cases, the absence of data is itself informative, and ignoring it would bias the conclusions.

Question 17. Why is the communication stage of the data science lifecycle considered just as important as the modeling stage? Give a specific example of what can go wrong when communication fails.

Answer

**Sample answer:** The modeling stage produces findings, but the communication stage is what makes those findings *actionable*. A brilliant model that nobody understands, trusts, or acts on has zero real-world impact. For example, a data science team at a hospital might build a model that accurately predicts which patients are at risk of sepsis, but if the results are presented as raw probability scores in a cluttered dashboard that nurses cannot interpret during a busy shift, the model will be ignored and patients will not benefit. Effective communication means translating technical results into the language, format, and level of detail appropriate for the audience — and it often determines whether a data science project succeeds or fails in practice.

Section 4: Applied Scenario (2 questions, 12 points each)

These problems present realistic situations. There is no single "correct" answer — you are graded on the quality of your reasoning, not on reaching a specific conclusion.

Question 18. The School Board Scenario

A school board member reads a report showing that schools with more computers per student have higher average test scores. She proposes spending $5 million to buy laptops for every student, arguing: "The data clearly shows that more computers lead to better scores."

(a) What type of question is the school board member treating the data as answering? What type of question does the data actually answer? (2 points)

(b) Identify two plausible alternative explanations for the observed relationship between computers and test scores. (4 points)

(c) Describe a study design that could more convincingly test whether providing laptops actually improves test scores. Specify what data you would collect and what comparison you would make. (4 points)

(d) How should the data scientist presenting this information to the school board frame the findings responsibly? (2 points)

Answer

**(a)** The school board member is treating the data as answering a **causal** question ("more computers *lead to* better scores"). The data actually answers a **descriptive** question ("schools with more computers *tend to have* higher scores"). The gap between correlation and causation is the central issue. **(b)** Two plausible alternative explanations: 1. **Wealth as a confound.** Schools with more computers per student are likely in wealthier districts. Wealthier districts also have higher-paid teachers, more parental involvement, better facilities, and students with more resources at home — any of which could drive higher test scores independently of the computers. 2. **Self-selection.** Schools that invest in technology may also be schools with more innovative or motivated leadership that *also* invests in teacher training, curriculum development, and other factors that improve scores. The computers are a *marker* of a well-run school, not the *cause* of good outcomes. **(c)** A **randomized controlled trial**: randomly select 50 schools from a pool of similar schools (matched on income, demographics, and baseline test scores). Give 25 of them the laptop program; the other 25 serve as a control group. Compare test score changes over one to two years between the two groups. Random assignment ensures that, on average, the two groups are similar in all ways except the laptop intervention, so differences in outcomes can be more confidently attributed to the laptops. **(d)** The data scientist should present the correlation honestly, *explicitly state that it does not establish causation*, and name the most likely confounding factors. They should frame the finding as "worth investigating further" rather than "proof that laptops work." If the board is considering a $5 million investment, the data scientist might recommend a smaller pilot study (see part c) before committing the full budget. **Rubric:** - (a) 2 points: correctly identifies the causal vs. descriptive gap. - (b) 4 points: 2 points per plausible, well-explained alternative. - (c) 4 points: describes a study with a comparison group and random assignment (or a strong quasi-experimental design). Loses points for vague designs ("just study it more") or designs that do not address confounding. - (d) 2 points: emphasizes honest framing and the distinction between correlation and causation.

Question 19. The Hiring Algorithm Scenario

A tech company builds a model to screen job applications. The model is trained on data from the past five years of hiring decisions: it learns which resume features are associated with candidates who were hired and rated as successful. The model achieves 88% accuracy on a test set. The company plans to use it to automatically reject the bottom 50% of applicants.

(a) Map this project onto the data science lifecycle. Identify which stages have been completed and which appear to have been skipped or done poorly. (4 points)

(b) The training data reflects five years of human hiring decisions. Why is this a problem? Identify a specific way the model could perpetuate or amplify existing biases. (4 points)

(c) A colleague says, "88% accuracy means the model is fair — it's objective because it's math." Write a 2-3 sentence response explaining why this reasoning is flawed. (4 points)

Answer

**(a)** Lifecycle stages: - **Question formulation:** Partially done — "Which candidates should we hire?" is a question, but it has been operationalized narrowly as "Which candidates *look like people we hired before*?" This is a subtly different (and more problematic) question. - **Data collection:** Completed — five years of hiring data. - **Data cleaning:** Unknown from the description, but presumably done to some degree. - **Exploratory analysis:** Appears to have been skipped or at least not mentioned. The team should have examined whether past hiring decisions show patterns of bias before using them as training labels. - **Modeling:** Completed — the model was built and evaluated on accuracy. - **Communication:** Poorly handled — the plan to auto-reject 50% of applicants is a high-stakes deployment decision being made without adequately communicating limitations to stakeholders. Key missing stage: the team does not appear to have critically examined whether their *historical hiring data* is a reliable ground truth. If past hiring was biased, the model learns to replicate that bias. **(b)** If the company's past hiring decisions were biased — for example, if human reviewers unconsciously favored candidates from prestigious universities, penalized resume gaps (which disproportionately affect women who took parental leave), or preferred names that sound like the dominant cultural group — then the model will learn these patterns as "features of successful candidates." Specific example: if few women were hired into engineering roles historically, the model may learn that female-associated resume features (women's colleges, women-in-tech organizations) are negative predictors, systematically screening out qualified women. The model does not *create* bias — it *automates and scales* existing bias. **(c)** Sample response: "Accuracy measures how often the model matches historical decisions, but if those historical decisions were biased, then high accuracy means the model is *accurately reproducing bias*. A model that perfectly replicated discriminatory hiring would score 100% accuracy against that data. Fairness requires examining *who* the model gets wrong and *how*, not just the overall error rate — an 88% accurate model that rejects 30% of qualified women but only 5% of qualified men is neither fair nor objective, regardless of its aggregate accuracy." **Rubric:** - (a) 4 points: identifies at least 4 stages, correctly notes which were done and which were missing or weak. Strongest answers note that the *question itself* was poorly formulated. - (b) 4 points: clearly explains the feedback loop from biased history to biased model. Gives a specific, concrete example. Loses points for vague answers ("it could be biased somehow") without mechanisms. - (c) 4 points: directly challenges the accuracy-equals-fairness assumption. Explains that accuracy is measured against potentially biased labels. Strongest answers note that aggregate accuracy can hide disparate impact on subgroups.

Section 5: Code / Analysis

This section is intentionally omitted for Chapter 1. No programming concepts have been introduced yet. Starting in Chapter 3 (when Python is introduced), this section will include code-reading, debugging, and short coding tasks. For now, your analytical muscles are getting a workout in Sections 1-4 — the code will come soon enough.

Scoring & Next Steps

Section	Questions	Points	Your Score
1. Multiple Choice	10	40	___ / 40
2. True/False with Justification	4	20	___ / 20
3. Short Answer	3	18	___ / 18
4. Applied Scenario	2	22	___ / 22
Total	19	100	___ / 100

Note

The quiz contains 19 scored questions (Questions 1-19). The total of 20 listed in the frontmatter includes the omitted Section 5, which will be active in later chapters.

Score	Assessment	Recommended Action
90-100	Excellent	You have a strong grasp of the foundational concepts. Proceed to Chapter 2. Consider tackling the Extension exercises (Part E) from the exercises file for additional depth.
70-89	Proficient	You understand the core ideas. Review any questions you missed, paying special attention to the explanations. If you struggled with Section 4 (scenarios), revisit the chapter sections on question types and the correlation-causation distinction before moving on.
50-69	Developing	Revisit the chapter, focusing on: the data science lifecycle, the three types of questions (descriptive/predictive/causal), and the distinction between data science and neighboring fields. Then retake the quiz. The concepts in Chapter 1 are the foundation for everything that follows.
Below 50	Needs review	Re-read Chapter 1 carefully, taking notes. Work through Part A of the exercises file before retaking this quiz. Consider discussing the material with a study partner or instructor. There is no shame in taking extra time here — these ideas are genuinely new, and getting them right now pays dividends throughout the course.

Remember: this quiz measures understanding of concepts, not memorization of definitions. If you can explain the ideas in your own words and apply them to new situations, you are ready to move forward — even if you did not get every multiple-choice question right.