53 min read

> "The goal is to turn data into information, and information into insight."

Learning Objectives

  • Define data science and distinguish it from statistics, machine learning, and software engineering
  • Identify the six stages of the data science lifecycle and explain what happens at each stage
  • Classify real-world problems as data science questions, distinguishing descriptive, predictive, and causal questions
  • Evaluate a news article's data claims by applying basic critical thinking about sources, samples, and conclusions
  • Articulate a personally meaningful question that data science could help answer

Chapter 1: What Is Data Science? (And What It Isn't) — A Map of the Field

"The goal is to turn data into information, and information into insight." — Carly Fiorina, former CEO of Hewlett-Packard


Chapter Overview

Somewhere right now, a public health analyst is staring at a spreadsheet of vaccination records, trying to figure out why one neighborhood's rates are half of another's. A small business owner is looking at six months of receipts, wondering whether they should stock up for a seasonal rush or brace for a slump. A sports journalist is pulling play-by-play data from an NBA database, trying to settle a bar argument about whether three-point shooting has really changed basketball. And a college student is downloading their university's grade distributions, suspicious that something unfair might be hiding in the numbers.

None of these people would call themselves data scientists. Not yet. But every single one of them is doing what data scientists do: taking a question they care about, finding data that might hold the answer, and trying to make sense of what the data says.

That's data science. And if you've ever Googled a question, compared prices across websites, or checked the weather forecast and decided to bring an umbrella, you've already done a version of it. This chapter is about making that instinct intentional, structured, and powerful.

In this chapter, you will learn to:

  1. Define data science and distinguish it from statistics, machine learning, and software engineering (all paths)
  2. Identify the six stages of the data science lifecycle and explain what happens at each stage (all paths)
  3. Classify real-world problems as data science questions, distinguishing descriptive, predictive, and causal questions (all paths)
  4. Evaluate a news article's data claims by applying basic critical thinking about sources, samples, and conclusions (standard + deep dive paths)
  5. Articulate a personally meaningful question that data science could help answer (all paths)

📝 Note — Learning path annotations: Objectives marked (all paths) are essential for every reader. Those marked (standard + deep dive) can be skimmed on the Fast Track but are important for deeper understanding. See "How to Use This Book" for full path descriptions.


1.1 Four People, Four Questions

Before we define anything, let me introduce you to four people. They're going to be with us throughout this entire book, and their stories will make every concept we learn feel grounded in something real.

📝 Note: Elena, Marcus, Priya, and Jordan are composite characters — illustrative examples based on common real-world data science applications. Their specific projects, datasets, and findings are constructed for pedagogical purposes, but the kinds of work they do represent genuine practices in their respective fields.

Elena: The Public Health Analyst

Elena works for a county public health department. In early 2021, her director dropped a question on her desk that would consume the next two years of her career: Why are COVID-19 vaccination rates so different across neighborhoods in our county — and what can we do about it?

Elena had the data. The state health department published vaccination records by ZIP code. The Census Bureau had demographic information. She could cross-reference vaccination rates with income levels, racial composition, distance to the nearest clinic, and dozens of other factors.

But having data wasn't the same as having answers. The data was messy — some ZIP codes were missing, some records were duplicated, and the demographic categories didn't match between the two sources. Even after she cleaned it up, the patterns were complicated. Low-income neighborhoods had lower vaccination rates, yes, but not always. Neighborhoods with large immigrant populations sometimes had very high rates and sometimes very low rates, depending on factors Elena couldn't see in the numbers alone.

Elena's work isn't just about crunching numbers. It's about understanding people through data, and then translating what she finds into actions that actually help — like opening a new mobile vaccination clinic in a specific neighborhood, or partnering with a trusted community organization to run outreach in a specific language.

Marcus: The Small Business Owner

Marcus owns a bakery called Rise & Shine that he started three years ago. He doesn't have a tech background — he went to culinary school, worked in restaurants for a decade, and scraped together enough savings to open his own place. But Marcus is drowning in data he doesn't know what to do with.

His point-of-sale system generates daily sales reports. His Instagram page shows engagement metrics. His supplier sends invoices that track ingredient costs. His Square terminal logs every transaction with timestamps, item names, and payment methods.

Marcus wants to answer seemingly simple questions: Which items should I promote next month? Should I hire another part-time baker for the holiday season? Are my catering orders growing or was that big November just a fluke?

These are data science questions. Marcus doesn't know it yet, but the skills he'll learn in this book — loading data, cleaning it, making charts, spotting trends — will literally help him make better business decisions. He doesn't need to become a programmer. He needs to become someone who thinks with data.

Priya: The Sports Journalist

Priya covers the NBA for an online sports publication. She's been in journalism for eight years, and she's watched the industry shift under her feet. The stories that get the most engagement now aren't opinion pieces or game recaps. They're data-driven analyses: "The Golden State Warriors Changed Basketball Forever: Here's the Statistical Proof" gets more clicks than "Warriors Win Game 4."

Priya wants to answer a question that fans argue about constantly: Has the three-point revolution actually made the NBA better, or just different? To answer it, she needs historical shooting data going back decades. She needs to define what "better" even means — more exciting? higher scoring? more competitive? — and then measure it.

Priya is a strong writer and a sharp thinker. She's not a programmer. But she's realized that the journalists who can pull their own data, run their own analyses, and create their own visualizations have a massive competitive advantage. She doesn't need to become a data scientist — she needs enough data science to be a better journalist.

Jordan: The College Student

Jordan is a junior majoring in sociology. Last semester, they took a statistics class and got a B+. This semester, they noticed something that's been nagging at them: two friends in different sections of the same course, taught by different professors, got very different grades despite doing similar quality work. Jordan started asking around and heard the same story over and over — Professor X is an easy grader, Professor Y is tough, and it all seems sort of random.

But is it actually random? Or is there a pattern? Jordan found that their university publishes grade distributions by department and course. They want to investigate: Are there systematic differences in grading across departments, courses, or professors? And if so, does it affect some students more than others?

Jordan's question is personal — their GPA matters for grad school applications, and if grading is systematically unfair, that affects their future. But it's also a bigger question about equity, transparency, and institutional accountability. It's exactly the kind of question that data science is built to answer.

What These Four Have in Common

Elena, Marcus, Priya, and Jordan are different people with different backgrounds, skills, and goals. But they all share three things:

  1. They have a question they care about. Not a vague curiosity, but a specific thing they want to understand.
  2. There's data that might hold the answer. Not perfect data, not complete data, but data that's relevant.
  3. They need a systematic way to get from question to answer. Gut feelings and anecdotes aren't enough.

That systematic way of getting from a question to a data-informed answer? That's data science.

🔄 Check Your Understanding

  1. What do Elena, Marcus, Priya, and Jordan all have in common, even though their fields are completely different?
  2. Pick one of the four characters. In one sentence, what is their core question?
  3. Why do you think "having data" isn't the same as "having answers"?

1.2 So What Is Data Science, Exactly?

Here's the honest truth: if you ask ten data scientists to define "data science," you'll get twelve answers. It's a young field, and people argue about its boundaries constantly. But after reading dozens of definitions and watching thousands of people actually do data science, here's the definition we'll use in this book:

Data science is the practice of using data to answer questions about the world. It draws on statistics, computer programming, and domain expertise — combined with critical thinking and ethical awareness — to extract meaningful insights from data.

Let's break that down piece by piece.

"Using data to answer questions about the world." This is the core. Data science starts with a question. Always. Before you write a single line of code, before you open a single spreadsheet, you need a question. Why are vaccination rates lower in this neighborhood? Which products should I promote next month? Has three-point shooting changed basketball? The question comes first. The data comes second.

"Statistics, computer programming, and domain expertise." Data science lives at the intersection of three skills. You need statistics to understand what data can and can't tell you — to know when a pattern is meaningful and when it's just noise. You need computer programming to work with data that's too large or too complex for a spreadsheet. And you need domain knowledge — expertise in whatever field you're working in — to ask the right questions, interpret the results correctly, and know when something doesn't make sense.

"Critical thinking and ethical awareness." This is the part that often gets left out of definitions, but it might be the most important part. Data can lie. Charts can mislead. Models can discriminate. A good data scientist doesn't just analyze data — they question it. Where did this data come from? Who collected it? Who's missing from it? Who benefits from my analysis, and who could be harmed?

📊 Real-World Application

When Netflix recommends a show you might like, that's data science. When your car insurance company sets your premium based on your driving history, that's data science. When a political campaign decides which neighborhoods to canvass, that's data science. When a hospital predicts which patients are most likely to be readmitted within 30 days, that's data science.

But data science is also Elena figuring out where to put a vaccination clinic. It's Marcus deciding whether to hire for the holidays. It's Priya settling an argument about basketball. It's Jordan investigating whether grading is fair.

Data science isn't just for tech companies with billion-dollar budgets. It's for anyone who has a question and data that might help answer it.

The Venn Diagram You'll See Everywhere

If you've googled "what is data science" before picking up this book, you've almost certainly encountered a Venn diagram showing three overlapping circles: statistics (or math), computer science (or programming), and domain expertise (or business knowledge). Data science sits in the center, where all three overlap.

This diagram is attributed to Drew Conway, who published a version of it around 2010, and it's become one of the most widely reproduced images in data science education.

It's useful, but it's incomplete. Here's what it gets right and what it misses:

What it gets right: You genuinely do need all three. Statistics without programming means you can analyze a small sample but can't handle a million-row dataset. Programming without statistics means you can manipulate data efficiently but don't know whether your results are meaningful. And both without domain expertise means you might build a technically perfect model that answers the wrong question — or worse, one that gets the right answer for the wrong reasons.

What it misses: The diagram doesn't show communication, ethics, or curiosity. But in practice, a data scientist who can't explain their findings to a non-technical audience isn't particularly useful. A data scientist who doesn't think about the ethical implications of their work can cause real harm. And a data scientist who isn't genuinely curious — who doesn't care about the question behind the data — will produce mediocre work.

Think of it this way: statistics, programming, and domain expertise are the skills of data science. But curiosity, communication, and ethics are the character of data science. You need both.

What Data Science Is NOT

Sometimes the clearest way to understand something is to understand what it isn't. Let's clear up some common confusions.

Data science is not statistics. Statistics is a branch of mathematics with roots going back centuries. It provides the theoretical foundations for much of what data scientists do — probability, hypothesis testing, regression, sampling theory. But data science is broader. It includes the practical work of collecting data, cleaning it, storing it, visualizing it, and communicating results. A statistician might prove a theorem about the properties of an estimator. A data scientist is more likely to be debugging why their data pipeline broke at 3 AM.

Data science is not machine learning. Machine learning is a set of techniques where computers learn patterns from data without being explicitly programmed for every scenario. It's a tool that data scientists use, but it's not the whole toolbox. Plenty of valuable data science involves no machine learning at all — sometimes a well-made bar chart answers the question better than the most sophisticated neural network. We'll learn machine learning in Part V of this book, but we'll spend the first four parts building the thinking skills that make machine learning meaningful.

Data science is not software engineering. Software engineers build systems. They write code that runs reliably, handles millions of users, and doesn't crash. Data scientists write code too, but their code is usually exploratory and analytical rather than production-ready. A software engineer might build the recommendation engine that powers Netflix. A data scientist figures out what the recommendation engine should recommend. The skills overlap, but the goals are different.

Data science is not "just" having big data. The term big data refers to datasets so large or complex that traditional methods can't process them — think billions of social media posts, petabytes of sensor readings, or real-time streams from millions of devices. Big data is important, but most data science doesn't require it. Elena's vaccination dataset has a few thousand rows. Marcus's sales data might have a few hundred. These aren't "big" by any tech-company standard, but they're perfectly good for data science. The size of your data matters less than the quality of your questions.

Data science is not magic. Data can't tell you things it doesn't contain. If you have sales data but no customer demographics, you can't learn about your customers' ages. If your survey only reached English speakers, you can't draw conclusions about non-English speakers. Data science is powerful, but it's constrained by what's in the data — and what's not in the data often matters just as much.

⚠️ Common Pitfall

A very common misconception among beginners is that data science is primarily about machine learning and AI. Social media and news coverage tend to focus on the flashiest applications — self-driving cars, chatbots, image recognition — because they make for exciting stories. This creates the impression that "real" data science requires deep learning and massive computing power.

In reality, the vast majority of data science work is descriptive analysis, data cleaning, and visualization. Research suggests that data professionals spend roughly 60-80% of their time on data preparation and exploration. The machine learning part, while important, is often a small fraction of the overall workflow. Don't feel like you need to build a neural network to be "doing" data science. If you're asking a question and using data to answer it systematically, you're doing data science.

🔄 Check Your Understanding

  1. In your own words, what are the three main skills that overlap in data science?
  2. Name one thing that data science is commonly confused with, and explain why they're different.
  3. Why does domain knowledge matter? What could go wrong without it?

1.3 A Brief History: How Did We Get Here?

📜 Historical Context

Understanding where data science came from helps explain why it looks the way it does today — and why you're hearing about it everywhere. This isn't a comprehensive history (that would fill its own book), but rather the key moments that shaped the field you're about to enter.

The Long Prologue: Statistics and Computing

Humans have been collecting data for as long as we've had civilizations. The ancient Egyptians conducted censuses to track their population for tax purposes. The Roman Empire counted its citizens to allocate military resources. In the 1600s and 1700s, European governments began systematically collecting data on births, deaths, and trade — the word "statistics" literally comes from the German word Statistik, meaning the science of the state.

Modern statistics as a mathematical discipline took shape in the late 1800s and early 1900s, with figures like Karl Pearson (who developed the correlation coefficient) and Ronald Fisher (who pioneered experimental design and analysis of variance). These statistical methods — ways of drawing conclusions from imperfect data — form the theoretical backbone of data science.

Meanwhile, computers were transforming what was possible with data. In the 1950s and 1960s, computers were room-sized machines that could crunch numbers faster than any human. By the 1980s, personal computers put computing power on individual desks. By the 2000s, the internet was generating data at a rate no previous generation could have imagined.

The Birth of "Data Science"

The term "data science" has been used in various forms since the 1960s, but its modern meaning crystallized in the 2000s and early 2010s. Several developments came together:

The data explosion. The internet, social media, smartphones, and sensors created a firehose of data. Companies that had been collecting megabytes of data were suddenly collecting terabytes, then petabytes. Traditional statistical methods, designed for small carefully-collected samples, weren't built for this.

Open-source tools. The Python and R programming languages, along with libraries like NumPy, pandas, and scikit-learn, made sophisticated data analysis accessible to anyone with a laptop. You no longer needed expensive proprietary software to do serious analytical work.

The "sexiest job" moment. In 2012, Harvard Business Review published an article by Thomas H. Davenport and DJ Patil titled "Data Scientist: The Sexiest Job of the 21st Century." It argued that organizations were drowning in data but starving for people who could make sense of it. The article didn't create the field, but it gave it a name and a moment in the public spotlight that drew waves of new practitioners.

Industry demand. Companies like Google, Facebook, Amazon, and Netflix had been hiring people who could analyze massive datasets for years, but they hadn't agreed on what to call them. "Data scientist" became the umbrella term, and job postings with that title grew exponentially through the 2010s.

Where We Are Now

Today, data science is a broad, interdisciplinary field that spans almost every industry. Healthcare organizations use it to predict patient outcomes. Governments use it to allocate resources. Newsrooms use it to investigate stories. Sports teams use it to evaluate players. Non-profits use it to measure the impact of their programs.

The tools have gotten more powerful and more accessible. Python has become the dominant language in data science education. Jupyter notebooks have become the standard environment for exploratory analysis. Cloud computing means you can rent supercomputer-level processing power by the hour.

But the fundamental challenge hasn't changed: you have data, you have a question, and you need a systematic way to get from one to the other. That's the challenge this book prepares you to meet.

🔍 Why Does This Work?

Why has data science become so important now, rather than twenty or fifty years ago? Three forces converged simultaneously: (1) the volume of available data exploded thanks to the internet and digital sensors, (2) computing power became cheap enough that individuals could process large datasets on personal laptops, and (3) open-source tools democratized the methods, so you no longer needed a statistics PhD or a corporate software license to do serious analysis. Remove any one of these three forces, and data science as we know it doesn't exist. This convergence is why a bakery owner like Marcus can now do analyses that would have required a team of consultants just fifteen years ago.


1.4 The Data Science Lifecycle: From Question to Insight

Every data science project, whether it's Elena investigating vaccination disparities or Marcus tracking bakery sales, follows a similar pattern. This pattern is called the data science lifecycle, and it has six stages. Understanding this lifecycle is one of the most valuable things you'll take from this chapter, because it gives you a mental framework for every project you'll ever do.

Let's walk through each stage.

Stage 1: Ask a Question

This is where everything begins, and it's the stage that beginners most often skip. The temptation is to jump straight to the data — to open a dataset and start poking around. But without a clear question, your analysis has no direction. You'll find patterns that might not mean anything. You'll make charts that look nice but don't answer anything useful. You'll spend hours going down rabbit holes.

A good data science question is:

  • Specific. Not "What's happening with vaccinations?" but "How do vaccination rates differ between neighborhoods in our county, and what demographic factors are associated with those differences?"
  • Answerable with data. You need to be able to imagine what data would help answer the question. If no data exists or could reasonably be collected, you might have a great philosophical question, but not a data science question.
  • Meaningful. The answer should matter to someone. It should inform a decision, settle a debate, reveal an injustice, or satisfy a genuine curiosity.

Elena's question: Why do vaccination rates differ across neighborhoods, and what can we do about it? Marcus's question: Should I hire extra staff for the holiday season, or was last year's spike a one-time thing? Priya's question: Has the three-point shooting revolution made the NBA statistically different from previous eras? Jordan's question: Do grading patterns at my university show systematic bias across departments or professors?

Notice that none of these questions mention Python, statistics, or machine learning. They're human questions. Data science is the method for answering them, but the question always comes first.

🚪 Threshold Concept: Data Science Is a Way of Thinking, Not a Set of Tools

This is the single most important idea in this entire chapter, and possibly this entire book. It's what learning scientists call a threshold concept — an idea that, once you truly internalize it, permanently changes how you see the field.

Here it is: Data science is a way of thinking, not a set of tools.

Python is a tool. Pandas is a tool. Machine learning is a tool. They're all useful, and we'll learn them all. But they're not data science any more than a hammer and nails are architecture. Architecture is about designing spaces where people can live and work. The hammer is just how you build it.

Data science is about asking good questions, thinking critically about evidence, understanding uncertainty, and communicating insights clearly. If you can do those things with a pencil and a calculator, you're a data scientist. If you can run a TensorFlow neural network but can't explain what question it's answering or why anyone should care, you're not.

This means something encouraging: you can start thinking like a data scientist today, right now, before you write a single line of code. By the end of this chapter, you will have practiced it.

It also means something challenging: the hardest parts of data science aren't the technical parts. They're the thinking parts. Choosing the right question. Recognizing when data is misleading you. Knowing what your analysis can't tell you. These are harder than learning Python syntax, and they're what separate good data scientists from merely competent ones.

Stage 2: Get Data

Once you have a question, you need data to answer it. This stage is about figuring out what data exists, where to find it, and how to get it into a form you can work with.

Data comes from many sources:

  • Public datasets from governments, international organizations, and research institutions (Census Bureau, WHO, CDC, World Bank)
  • Internal organizational data like sales records, patient files, or student grades
  • APIs (Application Programming Interfaces) that let you request data from online services
  • Web scraping — extracting data from websites programmatically
  • Surveys and experiments that you design and conduct yourself

Two important concepts to understand here are structured data and unstructured data.

Structured data is data that's organized into rows and columns — like a spreadsheet. Each row represents one observation (a patient, a sale, a game, a student), and each column represents one characteristic of that observation (name, date, score, grade). Structured data is what most data science tools are built to work with, and it's what we'll focus on for most of this book.

Unstructured data is everything else: text documents, images, audio recordings, social media posts, emails, PDF reports. Unstructured data is more abundant in the real world — estimates suggest that over 80% of the world's data is unstructured — but it's harder to analyze because it doesn't fit neatly into rows and columns.

Elena gets structured data from the state health department (rows = ZIP codes, columns = vaccination rates by age group) and from the Census Bureau (rows = ZIP codes, columns = demographic measures).

Marcus gets structured data from his point-of-sale system (rows = transactions, columns = date, item, price, payment method).

Priya gets structured data from basketball statistics databases (rows = games or players, columns = shooting statistics by season).

Jordan gets structured data from publicly posted grade distributions (rows = courses, columns = number of A's, B's, C's, etc. by semester).

Stage 3: Clean the Data

Here's a truth that every experienced data scientist knows and every beginner is surprised by: the data you get is almost never ready to analyze. Real-world data is messy. It's always messy. This isn't a sign that something went wrong — it's the normal state of things.

Common data problems include:

  • Missing values. Some patients didn't report their age. Some transactions are missing timestamps. Some countries didn't report vaccination data for certain months.
  • Inconsistent formatting. One column says "New York" and another says "NY" and a third says "new york, ny."
  • Duplicates. The same record appears twice because of a database glitch.
  • Wrong data types. A column of numbers is stored as text, so you can't calculate an average.
  • Errors. Someone's age is recorded as 999. A transaction amount is negative when it shouldn't be.

Data cleaning — also called data wrangling or data munging — is the process of detecting and fixing these problems. It's not glamorous. It won't make you famous on Twitter. But it's estimated to consume 60-80% of a typical data scientist's time. It's the part of the job that nobody warns you about, and it's the part you must learn to do well.

We'll dedicate all of Part II to data cleaning and wrangling. For now, just know that it exists, it's important, and it's completely normal to spend more time cleaning data than analyzing it.

Stage 4: Explore the Data

Once your data is clean (or clean enough — it's never perfect), you start exploring it. This stage is called exploratory data analysis, or EDA. The goal is to understand what's in the data before you try to answer your question formally.

Exploration typically involves:

  • Computing summary statistics (averages, counts, ranges, distributions)
  • Making lots of charts and visualizations
  • Looking for patterns, trends, and anomalies
  • Identifying relationships between different variables
  • Generating hypotheses about what might explain the patterns you see

This is the stage where data literacy — the ability to read, interpret, and reason with data — becomes essential. Data literacy is for data what reading comprehension is for text. It's not about technical skills; it's about understanding what numbers and charts are actually saying.

Exploration is also where you often discover that your original question needs refining. Elena might start exploring vaccination rates and discover that the biggest disparity isn't between neighborhoods with different income levels — it's between neighborhoods with and without a pharmacy within walking distance. That discovery changes her question and, ultimately, her recommendations.

Stage 5: Model or Analyze

This is the stage most people think of when they think of data science. Depending on your question, you might:

  • Describe what happened (descriptive analysis): What are the average vaccination rates by region? What were total sales by month?
  • Predict what will happen (predictive analysis): Based on a country's economic indicators, can we predict its vaccination rate? Based on past sales, what will December look like?
  • Explain why something happened (causal inference): Did the new vaccination outreach program cause rates to increase, or would they have increased anyway?

We'll explore these three types of analysis in detail in Section 1.5. For now, notice that not every data science project involves prediction. Sometimes the answer to your question is a well-made chart. Sometimes it's a single number with a measure of uncertainty around it. The "right" analysis depends entirely on the question you're trying to answer.

Stage 6: Communicate Results

This stage is just as important as all the others, and it's the one that gets the least attention in most technical education.

Data science that lives in a notebook nobody reads might as well not exist. The point of doing all this work — asking questions, gathering data, cleaning it, exploring it, analyzing it — is to produce insight that changes how someone thinks or acts. And that only happens if you can communicate your findings clearly, honestly, and persuasively.

Communication means different things depending on your audience:

  • Elena writes a report for the county health director with clear charts showing which neighborhoods need more resources and why. She doesn't include technical jargon.
  • Marcus makes a simple summary for himself: "Holiday sales increased 40% over baseline last year. Historical data suggests this is consistent, not a fluke. Recommendation: hire one additional part-time baker from November 15 through January 5."
  • Priya publishes a data-driven article with interactive charts that readers can explore.
  • Jordan prepares a presentation for the student government with grade distribution comparisons across departments.

In every case, the data scientist must translate technical findings into language and visuals that their audience can understand and act on.

📊 Real-World Application

The lifecycle in action — Elena's vaccination analysis:

  1. Question: Why do vaccination rates differ across our county's neighborhoods?
  2. Data: State vaccination records by ZIP code; Census demographic data; county GIS data showing pharmacy and clinic locations.
  3. Clean: Merge datasets by ZIP code; handle ZIP codes that appear in one source but not the other; standardize demographic categories.
  4. Explore: Map vaccination rates geographically; compute rates by income quartile, racial composition, and distance to nearest clinic; look for patterns.
  5. Analyze: Find that distance to nearest vaccination site explains more variation than income or demographics alone.
  6. Communicate: Recommend to the health director that three mobile vaccination clinics be deployed to specific underserved areas, with supporting data.

Notice that Elena's "model" wasn't a machine learning algorithm. It was a careful descriptive analysis with a geographic component. That's enough. The analysis was valuable because it answered a real question and led to a real decision.

🔄 Check Your Understanding

  1. What are the six stages of the data science lifecycle? Can you name them in order?
  2. Which stage typically takes the most time? (Hint: it's not the one most people expect.)
  3. Why is the "communicate results" stage so important? What happens to insights that don't get communicated?
  4. Thinking about Marcus's bakery: describe what each stage of the lifecycle might look like for his question about holiday staffing.

1.5 Three Types of Questions: Descriptive, Predictive, and Causal

Not all data science questions are the same. Understanding what type of question you're asking is crucial, because the type determines what methods you use, what conclusions you can draw, and what mistakes you need to watch out for. There are three fundamental types.

Descriptive Questions: "What happened?"

Descriptive analysis is about summarizing and understanding what the data shows. No predictions, no claims about cause and effect — just a clear, accurate picture of what is (or was).

Examples: - What was the average vaccination rate in each region last year? - How many sourdough loaves did Marcus sell each month? - What percentage of NBA shots are three-pointers, and how has that changed over the past 20 years? - What's the average GPA in the Biology department versus the English department?

Descriptive analysis is the foundation of all data science. It might sound simple, but doing it well is surprisingly hard. You need to choose the right summary measures (averages can be misleading if the data is skewed). You need to make visualizations that honestly represent the patterns. And you need to resist the temptation to interpret description as explanation — just because two things happen together doesn't mean one caused the other.

Here's a concrete example. Priya finds that in 1990, about 5% of NBA shot attempts were three-pointers. In 2024, it's over 40%. That's a descriptive finding — it tells you what happened. It doesn't tell you why it happened (coaching strategies? rule changes? player development?) or whether it's good for the game. Those are different types of questions.

Predictive Questions: "What will happen?"

Predictive analysis is about using patterns in existing data to forecast future outcomes.

Examples: - Based on economic indicators, what vaccination rate would we predict for a given country? - Based on past sales patterns, how many pumpkin spice muffins should Marcus bake next October? - Based on a player's college statistics, how many points per game will they score in their first NBA season? - Based on a student's first-semester grades, what's the probability they'll graduate in four years?

Prediction is powerful, but it comes with a crucial caveat: a prediction is not an explanation. A model might predict that countries with higher GDP tend to have higher vaccination rates. That doesn't mean giving a country more money will automatically increase its vaccination rate. The prediction tells you what is associated with what, but not why.

This distinction matters enormously in practice. Marcus's sales prediction model might tell him that December sales spike every year. But if he asks why they spike, the prediction model can't answer that. Maybe it's holiday parties. Maybe it's gift baskets. Maybe it's cold weather making people crave pastries. The "why" requires a different kind of analysis.

Causal Questions: "Did X cause Y?"

Causal inference is about determining whether one thing actually caused another. This is the hardest type of analysis, and it's the one where the most mistakes are made.

Examples: - Did the mobile vaccination clinic cause vaccination rates to increase in that neighborhood? - Did Marcus's Instagram ad campaign cause sales to go up, or would they have gone up anyway? - Did the NBA rule change allowing more contact cause more three-point shooting? - Did switching from curved grading to fixed grading cause average grades to increase?

The key challenge with causal questions is that you almost always need to know what would have happened without the intervention — what researchers call the counterfactual. If Elena opens a vaccination clinic in a neighborhood and vaccination rates go up, she can't automatically credit the clinic. Maybe rates were going up everywhere because of a new public awareness campaign. Maybe there was a COVID variant in the news that scared people into getting vaccinated. To establish causation, you need to carefully rule out alternative explanations.

The gold standard for causal inference is the randomized controlled experiment — the same design used in medical trials. You randomly assign some people to get the treatment and others to get a placebo, and you compare the outcomes. But in data science, experiments aren't always possible. You can't randomly assign countries to have high GDP to see if it increases vaccination rates. You can't randomly assign students to easy and hard graders to see if it affects their career outcomes.

When experiments aren't possible, data scientists use clever observational methods to approximate causal reasoning. We'll learn about these in Chapter 24. For now, the key lesson is simple: just because two things are correlated doesn't mean one caused the other. This is such an important principle that it has its own catchphrase: correlation does not imply causation.

A Framework for Classifying Questions

Here's a simple way to figure out what type of question you're dealing with:

Question Type Key Phrase What You're Doing Example
Descriptive "What happened?" or "What does this look like?" Summarizing patterns in data "What's the average vaccination rate by region?"
Predictive "What will happen?" or "What would we expect?" Forecasting future or unseen outcomes "What vaccination rate would we predict for this country?"
Causal "Did X cause Y?" or "What would happen if...?" Determining cause-and-effect relationships "Did the outreach program cause rates to increase?"

Most data science projects involve a mix of all three types. Elena starts with description (what do vaccination rates look like across neighborhoods?), moves to prediction (can we predict which neighborhoods will have low rates?), and ultimately wants to address causation (if we put a clinic here, will rates go up?).

🧩 Productive Struggle

Classify each of the following as a descriptive, predictive, or causal question. Some might be tricky — they might look like one type at first glance but actually be another. Take at least five minutes to think through these before checking the discussion below.

  1. "Students who sit in the front row get higher grades. Does sitting in the front row improve your grade?"
  2. "Based on a neighborhood's demographics and distance to the nearest hospital, can we estimate its likely emergency room visit rate?"
  3. "What percentage of our customers are repeat buyers?"
  4. "Did the new company wellness program reduce employee sick days?"
  5. "If we increase our advertising budget by 20%, how much will sales increase?"

Discussion: 1. Causal — it's asking whether sitting in the front row causes better grades. (It probably doesn't — motivated students choose to sit up front, and motivation, not location, drives the grades. This is a classic confounding variable situation.) 2. Predictive — it's asking what we'd expect for an unobserved outcome based on known characteristics. 3. Descriptive — it's asking for a summary statistic about what's already happened. 4. Causal — it's asking whether the program caused the reduction, not just whether the reduction happened. 5. This one's tricky. It sounds causal ("if we do X, Y will happen"), but in practice, it's often answered with a predictive model. A truly causal answer would require an experiment — running ads in some markets but not others and comparing results. Most companies treat this as a prediction question, which means they should be cautious about how confidently they interpret the answer.

🔄 Check Your Understanding

  1. Explain the difference between a descriptive and a predictive question in your own words.
  2. Why is it harder to answer a causal question than a descriptive one?
  3. Come up with one example of each type of question from your own life or interests.

1.6 What Does a Data Scientist Actually Do All Day?

If you're considering data science as a career — or if you're just curious about what the day-to-day reality looks like — it's worth painting an honest picture. The reality is quite different from what job postings and social media suggest.

The Actual Job

A data scientist is someone who uses data science methods professionally — but what that looks like varies enormously by organization, industry, and role. Here's a rough breakdown of how a typical data scientist might spend their week:

30-40% — Data wrangling and cleaning. Finding, loading, merging, and cleaning data. Dealing with missing values, inconsistent formats, and data quality issues. This is the largest chunk of time, and it's the part that rarely makes it into job descriptions or LinkedIn posts.

20-30% — Exploratory analysis and visualization. Making charts. Computing summaries. Looking for patterns. Iterating between hypotheses and evidence. This is the detective work — the part that feels most like "doing data science."

10-20% — Modeling and analysis. Building statistical models, running hypothesis tests, or training machine learning models. This is the part that gets the most attention in courses and media coverage, but it's a smaller fraction of the job than most people expect.

15-25% — Communication and collaboration. Writing reports. Making presentations. Meeting with stakeholders to understand their questions. Explaining results to people who don't speak "data." This is the part that turns analysis into impact.

5-10% — Learning and maintenance. Keeping up with new tools, debugging existing analyses, and improving workflows.

The Many Flavors of "Data Scientist"

The title "data scientist" covers a wide range of roles. Here are some common variations:

  • Data Analyst: Focuses on descriptive analysis, reporting, and visualization. Often works with business stakeholders to track metrics and identify trends. Tools: SQL, Excel, Tableau, Python/R.
  • Machine Learning Engineer: Focuses on building and deploying predictive models in production. More engineering-oriented than a typical data scientist. Tools: Python, TensorFlow/PyTorch, cloud platforms.
  • Research Scientist: Focuses on developing new methods or applying advanced techniques to complex problems. Often found in academia, big tech, and pharmaceuticals.
  • Data Engineer: Focuses on building the infrastructure that makes data science possible — data pipelines, storage systems, and processing frameworks. More software engineering than analysis.
  • Business Intelligence Analyst: Focuses on creating dashboards, reports, and monitoring systems that help organizations track their performance.

These roles overlap significantly, and many organizations use the titles inconsistently. Don't get too hung up on labels. The skills you learn in this book — asking good questions, working with data, creating visualizations, thinking statistically, communicating results — are valuable in all of these roles.

📊 Real-World Application

A week in Marcus's (future) life as a data-savvy business owner:

  • Monday: Downloads last week's sales data from Square. Notices that croissant sales dropped 20%. Wonders if it's the weather (cold snap kept people home) or the product (did the new recipe not land?).
  • Tuesday: Makes a chart comparing croissant sales over the past 3 months, overlaid with daily temperature data. Sees that the drop correlates with a snowstorm. Tentative conclusion: weather, not product quality.
  • Wednesday: Looks at holiday sales data from the past 2 years. Calculates that December-January sales are consistently 35-45% higher than the annual average. Updates his staffing plan.
  • Thursday: A customer survey mentions that people want more gluten-free options. Marcus counts how many gluten-free requests he's logged this month (12) versus last month (4). Thinks about whether to add a gluten-free muffin.
  • Friday: Prepares a simple one-page summary for his business partner: key metrics for the week, one chart, two recommendations.

Marcus didn't use machine learning. He didn't write a single line of code (though later in this book, he'll learn to). But he thought with data all week long. That's data science.


1.7 Data-Driven Decision Making — And Its Limits

You'll hear the phrase data-driven decision making a lot in this field. It means using evidence from data — rather than intuition, authority, or tradition — to guide choices. It's the backbone of modern data science practice, and it's genuinely powerful.

Elena uses data to decide where to place mobile vaccination clinics. Marcus uses data to decide whether to hire for the holidays. A hospital uses data to decide which patients need extra monitoring after discharge.

But data-driven decision making has limits, and understanding those limits is part of becoming a good data scientist rather than a naive one.

Limit 1: Data Reflects the Past

All data is historical. It tells you what has happened, not necessarily what will happen. Marcus's sales data shows patterns from the past two years, but the future might be different. A new competitor might open nearby. A pandemic might change customer behavior. A viral TikTok might suddenly make his sourdough famous. Data-driven predictions are extrapolations, and extrapolations break when the world changes.

Limit 2: Data Reflects Who Was Counted

Every dataset represents someone's decisions about what to count and how to count it. If the vaccination data only tracks people who go to official vaccination sites, it misses people who got vaccinated at pop-up events. If the sales data only tracks credit card purchases, it misses cash customers. If Jordan's grade data only includes departments that publish distributions, departments that don't publish are invisible.

The phrase "every dataset has a human story" means that behind every row and column, there are decisions — about what to include, what to exclude, how to categorize, and how to measure. Those decisions shape what the data can tell you and, just as importantly, what it can't.

💡 Intuition — Thinking about who's missing

Imagine you run a survey about customer satisfaction at a restaurant. You put the survey on the receipt, and customers fill it out voluntarily before they leave. Who fills out the survey? Probably people who had a really great experience (they want to rave) and people who had a really terrible experience (they want to complain). The average customer — who had a fine, forgettable meal — probably doesn't bother.

So your survey data is biased. It overrepresents extreme opinions and underrepresents the middle. Any analysis based on this data will overestimate both how happy and how unhappy your customers are.

This isn't a flaw in your analysis. It's a flaw in your data collection process. But if you don't think about it, you'll draw the wrong conclusions. Data scientists think about this stuff constantly.

Limit 3: Data Doesn't Make Decisions — People Do

Data informs decisions, but it doesn't make them. Elena's data might show that a particular neighborhood needs a vaccination clinic. But the decision to fund that clinic involves politics, budgets, logistics, and community priorities that aren't in the data. Marcus's data might suggest hiring for the holidays, but the decision depends on whether he can afford it and whether he can find good candidates.

A good data scientist understands that their job is to provide the best possible evidence, clearly communicated, to the people who make decisions. The decision itself involves values, priorities, and context that go beyond what data can capture.

🗣️ Debate/Discussion Framework

"Should all important decisions be data-driven?"

Consider this scenario: A school district is deciding whether to close a small neighborhood elementary school. The data shows that the school is underperforming on test scores, has declining enrollment, and costs 40% more per student than the district average. The data says: close the school.

But the school is the heart of its neighborhood. It's where families gather, where kids walk rather than riding buses for an hour, and where the community's sense of identity lives. Parents are devastated at the thought of losing it.

Position A: Data should drive the decision. Resources are limited, and we owe it to students across the district to allocate them where they'll do the most good. Keeping an expensive, underperforming school open because of emotion is a disservice to students at other schools.

Position B: Data is one input, not the only input. Test scores don't capture everything a school provides to its community. A purely data-driven decision would miss the human cost of the closure and the intangible value of neighborhood schools.

Position C: The real problem is that the data is incomplete. If we measured community impact, parent engagement, walking-distance access, and neighborhood stability, the data might tell a different story. The issue isn't using data versus not using it — it's which data we choose to measure.

Where do you stand? There's no single right answer, and that's exactly the point. Data science gives you evidence, but the hardest decisions involve weighing evidence against values. Becoming comfortable with that tension is part of becoming a mature data scientist.

🔄 Check Your Understanding

  1. In your own words, what is "data-driven decision making"?
  2. Name one limitation of relying on data to make decisions.
  3. Why does it matter who is (and isn't) represented in a dataset?

1.8 Evaluating Data Claims in the Wild

You encounter data claims every day, whether you realize it or not. "A new study finds that coffee drinkers live longer." "Our product has a 98% satisfaction rate." "Crime is up 15% this year." As someone learning data science, you have both the opportunity and the responsibility to evaluate these claims critically.

Here's a framework — call it the SAUCE test — that you can apply to any data claim you encounter:

S — Source: Who is making this claim? What's their motivation? A study funded by a coffee company that says coffee is healthy deserves more scrutiny than an independent study with the same finding.

A — Amount: How much data is this based on? A survey of 12 people is very different from a survey of 12,000 people. Bigger isn't always better, but tiny samples are rarely reliable.

U — Uncertainty: Does the claim acknowledge uncertainty? Honest data analysis almost always involves uncertainty — margins of error, confidence intervals, ranges rather than single numbers. If someone presents a precise number with no qualifiers ("exactly 73.4% of people prefer X"), be suspicious.

C — Comparison: Compared to what? "Sales are up 15%" sounds great, but compared to what? Last month? Last year? The industry average? Without a comparison, a number is just a number.

E — Explanation: Is the explanation offered justified by the data? If someone says "coffee drinkers live longer because coffee is healthy," ask yourself: is there another explanation? Maybe coffee drinkers tend to be wealthier, and wealth predicts health. This is the correlation-vs.-causation problem again.

Scenario Walkthrough: Reading a News Article Critically

Let's practice this. Imagine you read this headline:

"Study: Students Who Use Laptops in Class Get Lower Grades"

The article describes a study where researchers compared the grades of students who used laptops in lectures to students who took notes by hand. Laptop users had, on average, a GPA that was 0.3 points lower.

Let's apply the SAUCE test:

Source: Who conducted the study? If it was a university research group publishing in a peer-reviewed journal, that's reasonably trustworthy. If it was a company that sells paper notebooks, be cautious.

Amount: How many students were in the study? 50? 500? 5,000? Were they from one school or many? A study at a single Ivy League university might not generalize to community colleges.

Uncertainty: Was the 0.3-point difference statistically significant? Could it have arisen by chance? Does the article mention a margin of error or p-value?

Comparison: What were the groups being compared? Were the laptop users and hand-writers similar in other ways — same courses, same backgrounds, same motivation levels? Or were the laptop users disproportionately in harder courses?

Explanation: The headline implies that laptop use causes lower grades. But is that what the study actually found? Maybe students who struggle academically are more likely to use laptops as a crutch. Maybe students who use laptops are more likely to browse social media during class, and it's the social media, not the laptop itself, that's the problem. The data might show an association, but the headline presents it as causation.

This is exactly the kind of critical thinking that data science develops. You don't need to run your own study to evaluate a claim. You just need to ask the right questions about the data behind it.

⚖️ Ethical Analysis

Data claims in the news can have real consequences. If the laptop study becomes widely cited, schools might ban laptops from classrooms. That policy would affect students with disabilities who need laptops for accessibility, students who type faster than they write, and students in online-hybrid courses.

Before a data claim becomes a policy, it's worth asking: Who benefits from this interpretation of the data? Who is harmed? What alternative interpretations exist? This ethical dimension is something we'll return to throughout the book — because data science doesn't happen in a vacuum, and the conclusions we draw affect real people.


1.9 Your Turn: Finding Your Own Question

We've been talking about Elena, Marcus, Priya, and Jordan's questions. Now it's time for yours.

One of the most powerful things about data science is that it's personal. You don't have to care about vaccination rates or basketball statistics or grading bias (though you might). You just need to care about something — and have a hunch that data might help you understand it better.

Here are some prompts to get you thinking:

In your daily life: - Is the "express" checkout lane at the grocery store actually faster? - Does the weather affect your mood? - How has the cost of your grocery basket changed over the past year? - Do you actually sleep better on weekends, or does it just feel that way?

In your community: - Are potholes more common in some neighborhoods than others? Why? - How does air quality in your city compare to similar cities? - Are there food deserts (areas without nearby grocery stores) near you?

In your interests: - Does home-field advantage matter more in some sports than others? - Are sequels generally rated lower than original movies? - Has the length of popular songs changed over the past 50 years? - Do books that win literary prizes actually sell more copies?

In your work or school: - Does class size affect student performance in your department? - Does the day of the week affect productivity at your workplace? - Do some types of marketing emails get more engagement than others?

Pick one question that genuinely interests you. Write it down. Now ask yourself:

  1. What type of question is it — descriptive, predictive, or causal?
  2. What data might help answer it? Does that data exist?
  3. Who would care about the answer?

That question is yours for the rest of this book. As you learn new skills — loading data, cleaning it, visualizing it, testing hypotheses — you'll have opportunities to apply each skill to your own question. By the end, you'll have answered it.

🪞 Self-Assessment

Take a moment to reflect honestly:

  • How comfortable do you feel right now with the idea of "doing data science"? Rate yourself 1-5, where 1 is "completely intimidated" and 5 is "let's go."
  • Which of the three skills (statistics, programming, domain knowledge) do you feel strongest in? Which feels most daunting?
  • Did anything in this chapter surprise you? What was different from your previous impression of data science?

There are no wrong answers. The purpose of this check-in is to know where you're starting from. Come back to these answers when you finish the book and see how far you've traveled.


📐 Project Checkpoint: Defining Your Research Questions

Throughout this book, you'll build a complete data analysis of a real public health dataset — exploring global vaccination rate data from the World Health Organization and the U.S. Centers for Disease Control and Prevention. Each chapter adds one layer to this project. By the end, you'll have a polished Jupyter notebook report that you could include in a portfolio.

For this first checkpoint, you don't need any code. You don't need any data. You just need the skill we've been building all chapter: asking good questions.

Your Task

Write 3-5 research questions about global vaccination rate disparities. These questions will guide your analysis for the rest of the book. At least one should be descriptive, at least one should be predictive, and at least one should address causation or explanation.

Here are some examples to spark your thinking (don't just copy these — adapt them or create your own):

  • Descriptive: "Which regions of the world have the lowest childhood vaccination rates, and how have those rates changed over the past decade?"
  • Descriptive: "How much variation exists in vaccination rates among countries with similar GDP levels?"
  • Predictive: "Based on a country's economic indicators (GDP per capita, healthcare spending), how well can we predict its vaccination rate?"
  • Causal/Explanatory: "Did international aid programs targeted at vaccination infrastructure lead to measurable increases in vaccination rates in recipient countries?"
  • Causal/Explanatory: "To what extent do healthcare access factors (doctors per capita, distance to clinics) explain vaccination rate differences, compared to economic factors?"

Quality Checklist for Your Questions

For each question, verify:

  • [ ] It's specific enough to guide an analysis (not just "what about vaccinations?")
  • [ ] You can imagine what data you'd need to answer it
  • [ ] The answer would be interesting or useful to someone
  • [ ] You've labeled it as descriptive, predictive, or causal
  • [ ] You've noted at least one challenge or limitation in answering it

Write your questions down now. We'll return to them in Chapter 2, when you set up your Jupyter notebook, and they'll evolve and sharpen as your skills grow.


Practical Considerations

As you begin your data science journey, keep these practical realities in mind:

You don't need to know everything before you start. Data science is learned iteratively. You'll learn a concept, apply it, realize you don't fully understand it, learn it again more deeply, and repeat. This is normal. It's not a sign that you're doing it wrong — it's how the learning process works.

Imposter syndrome is extremely common. If you look at data science communities online and feel like everyone knows more than you, remember that people tend to share their successes, not their struggles. Every experienced data scientist once spent an hour trying to figure out why their code wasn't working, only to discover they had a typo. You're in good company.

The tools change. The thinking doesn't. Python might not be the dominant language ten years from now. Pandas might be replaced by something faster. But the ability to ask a good question, think critically about evidence, understand uncertainty, and communicate insights clearly — those skills are permanent. That's why this book emphasizes thinking alongside tools.

Start small. You don't need a million-row dataset to practice data science. Marcus's bakery data — maybe a few hundred rows — is perfect. Jordan's grade distributions might fit on a single screen. Small datasets are easier to understand, faster to work with, and more forgiving when you make mistakes. Start small, get comfortable, and scale up later.

Make mistakes on purpose. Or rather, don't be afraid of mistakes. Every error message you encounter is teaching you something. Every wrong turn in an analysis is showing you what doesn't work. The fastest way to learn is to try things, break things, and figure out why they broke.


Chapter Summary

Key Concepts

  • Data science is the practice of using data to answer questions, combining statistics, programming, and domain knowledge with critical thinking and ethical awareness.
  • The data science lifecycle has six stages: Ask a Question, Get Data, Clean the Data, Explore the Data, Model or Analyze, and Communicate Results.
  • Data science questions come in three types: descriptive (what happened?), predictive (what will happen?), and causal (did X cause Y?).
  • Structured data lives in rows and columns; unstructured data includes text, images, and other formats that don't fit neatly into tables.
  • Domain knowledge — expertise in the specific field you're analyzing — is essential for asking the right questions and interpreting results correctly.
  • Data literacy is the ability to read, interpret, and reason with data — a skill as fundamental as reading text.
  • Data science is not the same as statistics, machine learning, software engineering, or "big data," though it draws from all of them.
  • Data-driven decision making is powerful but has limits: data reflects the past, represents only those who were counted, and informs decisions rather than making them.

Key Questions to Ask Yourself

Before starting any data science project:

  1. What question am I trying to answer?
  2. What type of question is it — descriptive, predictive, or causal?
  3. What data would I need to answer it?
  4. Does that data exist? What's missing from it?
  5. Who will use my findings, and what decisions might they inform?
  6. Who could be affected by my analysis, and how?

Decision Framework: "Is This a Data Science Problem?"

When someone brings you a question and asks "can data science help?", use this simple framework:

  1. Is there a specific question? If the request is "look at the data and find something interesting," push back gently. What are you trying to decide? What would change if you had the answer?
  2. Is there relevant data? Not perfect data — just relevant data. If no data exists or could reasonably be collected, data science can't help (yet).
  3. Is the question answerable from data? Some questions are important but not empirical. "Should we prioritize equity or efficiency?" is a values question, not a data question. Data can inform the debate, but can't resolve it.
  4. Would the answer change someone's behavior? If the answer wouldn't affect any decision, the analysis might not be worth doing — no matter how interesting the question is.

If you answered "yes" to all four, data science can probably help. If not, the question might need to be refined before data science becomes useful.


What's Next

In this chapter, you've learned what data science is, why it matters, and how to start thinking like a data scientist. You've met four people whose stories will anchor your learning for the rest of this book. You've learned the six stages of the data science lifecycle. You've started classifying questions by type. And you've written your own research questions about a real-world problem.

But you've done all of this with words and ideas. In Chapter 2: Setting Up Your Toolkit, we get practical. You'll install Python and Jupyter on your computer, create your first notebook, and write your first lines of code. It's the moment where data science stops being something you read about and starts being something you do.

The gap between "I understand the concept" and "I can actually do it" is the gap that this book is built to help you cross. Chapter 1 gave you the concepts. Chapter 2 gives you the tools. Let's go.

🔗 Connection: The question-first thinking from this chapter will return in every single chapter of this book. In Chapter 6, when you do your first real data analysis, you'll start by writing your question down in a Jupyter cell. In Chapter 14, when you learn about visualization, you'll choose chart types based on what question you're trying to answer. In Chapter 23, when you learn hypothesis testing, you'll translate your question into a formal statistical test. The question always comes first. Always.