> "The greatest value of a picture is when it forces us to notice what we never expected to see."
Learning Objectives
- Distinguish between observational studies and experiments
- Identify and evaluate sampling methods (random, stratified, cluster, convenience, systematic)
- Explain why randomization matters for both sampling and experimentation
- Recognize common sources of bias in data collection
- Evaluate whether a study design supports causal conclusions
In This Chapter
- Chapter Overview
- 4.1 Two Ways to Collect Data: Observing vs. Experimenting
- 4.2 Sampling: How You Choose Matters More Than You Think
- 4.3 Bias: When Your Data Lies to You
- 4.4 The 1936 Literary Digest Disaster: A Cautionary Tale
- 4.5 Why Randomization Matters
- 4.6 Confounding Variables: The Hidden Threat
- 4.7 Designing Experiments: Treatment, Control, and Blinding
- 4.8 Evaluating Causal Claims: A Checklist
- 4.9 Connection to AI: Training Data as a Sample
- 4.10 Project Checkpoint: Evaluating Your Dataset's Collection Method
- 4.11 Spaced Review: Strengthening Previous Learning
- Chapter Summary
- What's Next
Chapter 4: Designing Studies: Sampling and Experiments
"The greatest value of a picture is when it forces us to notice what we never expected to see." — John Tukey, statistician
Chapter Overview
Here's a question that might surprise you: the most important part of a statistical analysis happens before anyone touches the data.
Think about that for a second. In Chapters 1 through 3, you learned what statistics is, how to classify variables, and how to load and explore data in Python. You've got tools. You've got vocabulary. You're ready to start crunching numbers.
But here's the uncomfortable truth: if the data was collected badly, no amount of clever analysis can save it. You can run the fanciest machine learning algorithm in the world, and if the data feeding it is biased, the results will be biased too. Garbage in, garbage out — it's one of the oldest sayings in computing, and it's never been more relevant than right now.
In this chapter, you're going to learn how data should be collected — and all the ways it can go wrong. You'll discover why a famous magazine predicted the wrong winner of a presidential election despite surveying 2.4 million people. You'll understand why ice cream sales and drowning deaths rise together (and why ice cream doesn't cause drowning). And you'll see how modern tech companies run thousands of experiments every year using a technique called A/B testing.
Most importantly, you'll develop a critical eye for evaluating any claim that says "studies show..." Because the design of a study determines what it can prove — and anyone trying to mislead you is counting on the fact that you don't know that.
In this chapter, you will learn to: - Distinguish between observational studies and experiments - Identify and evaluate five sampling methods: random, stratified, cluster, convenience, and systematic - Explain why randomization matters for both sampling and experimentation - Recognize common sources of bias in data collection - Evaluate whether a study design supports causal conclusions
Fast Track: If you've taken AP Statistics or a prior course covering sampling and experimental design, skim Sections 4.1-4.3 and jump to Section 4.6 ("Confounding Variables: The Hidden Threat"). Complete quiz questions 1, 8, and 15 to verify your foundation.
Deep Dive: After this chapter, read Case Study 1 (the 1936 Literary Digest poll) for a dramatic real-world example of sampling bias, then Case Study 2 (A/B testing at tech companies) to see how modern experiments are designed at scale.
4.1 Two Ways to Collect Data: Observing vs. Experimenting
Let's start with Dr. Maya Chen.
Maya is an epidemiologist studying asthma in low-income communities. She's noticed that children in certain zip codes have asthma rates three times higher than the national average. She wants to understand why — and more importantly, she wants to know if anything can be done about it.
She has two fundamentally different approaches available to her:
Approach 1: She could survey families in high-asthma and low-asthma neighborhoods, record their living conditions (mold, air quality, proximity to highways, access to healthcare), and look for patterns in the existing data. She doesn't change anything — she just observes what's already happening.
Approach 2: She could select a group of families, randomly assign half of them to receive free air purifiers and home mold remediation, and compare their children's asthma outcomes to the other half who didn't receive the intervention. She actively changes something and measures the result.
These two approaches represent the most fundamental distinction in study design:
An observational study observes individuals and measures variables of interest without attempting to influence the responses. The researcher collects data on what already exists.
An experiment deliberately imposes a treatment on individuals in order to observe their responses. The researcher actively changes something.
This distinction isn't just academic — it determines what conclusions you can draw. And getting this wrong is one of the most common mistakes in interpreting research.
The Comparison Table
| Feature | Observational Study | Experiment |
|---|---|---|
| Does the researcher intervene? | No — observes what already exists | Yes — imposes a treatment |
| Can establish causation? | Generally no | Yes (if well-designed) |
| Ethical flexibility | Can study harmful exposures | Cannot assign harmful treatments |
| Cost and complexity | Usually lower | Usually higher |
| Real-world relevance | High — studies natural conditions | Variable — lab vs. field |
| Confounding risk | High | Low (if randomized) |
| Example | Surveying asthma rates across neighborhoods | Providing air purifiers to randomly selected homes |
Professor Washington's Dilemma
Professor James Washington is studying whether a predictive policing algorithm shows racial bias. He's found that neighborhoods flagged as "high crime" by the algorithm are disproportionately communities of color.
But here's his challenge: is the algorithm biased, or is it reflecting real crime patterns that are themselves shaped by decades of biased policing? He can't randomly assign police officers to patrol different neighborhoods equally (that's a logistical and political impossibility). He can't randomly assign which neighborhoods get flagged by the algorithm (that would compromise public safety policies). He's stuck with observational data — data shaped by the very system he's trying to evaluate.
This is the central limitation of observational studies: you can identify associations, but you can't isolate causes. Washington can show that the algorithm's predictions correlate with race. He can even control for other variables to strengthen his case. But he can't definitively prove the algorithm causes biased outcomes, because he can't run an experiment.
And yet — some of the most important questions in social science can only be studied observationally. You can't randomly assign people to grow up in poverty. You can't randomly assign neighborhoods to have polluted air. You can't randomly assign countries to have different forms of government. Observational studies aren't inferior to experiments — they're essential tools for studying questions where experiments would be impossible or unethical.
Key insight: The type of study determines the type of conclusion. Observational studies can show that A and B are associated. Only experiments can show that A causes B.
4.2 Sampling: How You Choose Matters More Than You Think
Before you can study a population, you need a sample. And how you choose that sample can make or break your entire analysis.
Remember from Chapter 1: a population is the entire group you want to study. A sample is the subset you actually observe. From Chapter 2: a parameter describes the population (the "truth" you're trying to discover), and a statistic describes the sample (what you actually measure).
Here's the core problem: you almost never have access to the entire population. The U.S. Census tries to count every person in America, and even that misses people. So you need a sample — and you need it to be representative of the population you care about. If your sample is systematically different from the population, your conclusions will be wrong, no matter how large your sample is.
That last part is crucial, so let me say it again: a bad sample doesn't get better just because it's bigger. A biased sample of 2.4 million people can give you worse results than a well-designed sample of 50,000. We know this because it actually happened — but I'll save that story for Section 4.4.
Let's look at five major sampling methods.
Simple Random Sampling
A random sample (more precisely, a simple random sample) is one in which every member of the population has an equal chance of being selected. It's the gold standard of sampling — the method against which all others are compared.
Imagine Dr. Chen wants to survey 500 families in a county with 50,000 families. In a simple random sample, she would assign every family a number from 1 to 50,000, then use a random number generator to pick 500 of those numbers. Every family has a 500/50,000 = 1% chance of being selected.
Why it works: Random selection means that, on average, your sample will look like the population. It won't be perfect — any given sample might have slightly more young families or slightly fewer renters than the population — but these differences will be due to chance, not systematic bias. And as we'll see in Chapter 11, we can actually quantify how much a random sample might differ from the population.
The catch: You need a complete list of the population (a "sampling frame") to draw from. For some populations, this is easy — Dr. Chen could get a list of all registered addresses in the county. For others, it's impossible — there's no master list of "all Americans who suffer from migraines."
Stratified Sampling
Stratified sampling divides the population into subgroups (called strata) and then takes a random sample from each subgroup.
Suppose Dr. Chen's county has three income levels: low, middle, and high income neighborhoods, making up 20%, 50%, and 30% of the population. If she takes a simple random sample of 500, she might get unlucky and end up with very few low-income families — exactly the group she's most interested in.
With stratified sampling, she would first divide the county into the three income strata, then randomly sample from each: - Low income (20%): randomly select 100 families - Middle income (50%): randomly select 250 families - High income (30%): randomly select 150 families
Why it works: Stratified sampling guarantees representation of important subgroups. Every stratum is included proportionally (or even oversampled if a subgroup is particularly important to study). This often produces more precise estimates than simple random sampling, especially when the variable of interest (like asthma rates) varies a lot between strata.
The catch: You need to know which strata are relevant before you sample. And you need to know which stratum each population member belongs to.
Cluster Sampling
Cluster sampling divides the population into naturally occurring groups (clusters) and then randomly selects entire clusters to include. Everyone within the selected clusters is studied.
Dr. Chen might use city blocks as clusters. She could list all 500 city blocks in her county, randomly select 25 blocks, and survey every family on those 25 blocks.
Why it works: Cluster sampling is practical and cost-effective. Sending researchers door-to-door on 25 contiguous blocks is much cheaper than sending them to 500 randomly scattered addresses across the county. It's especially useful for large, geographically spread populations.
The catch: Cluster sampling is generally less precise than simple random or stratified sampling, because people within the same cluster tend to be similar (neighbors often have similar income levels, housing types, and exposure to pollutants). You need more total observations to achieve the same precision.
Convenience Sampling
A convenience sample includes whoever is easiest to reach. It's the most common type of sampling in the real world — and the most dangerous.
If Dr. Chen surveys families who visit a particular clinic, or posts a survey on a community Facebook group, or asks for volunteers at a school PTA meeting, she's collecting a convenience sample. The people who show up are not randomly selected — they're self-selected, and they're likely systematically different from the broader population.
Why it's tempting: It's fast, cheap, and easy. Many student research projects, online polls, and viral surveys use convenience samples.
Why it's dangerous: Convenience samples almost always suffer from systematic bias. People who visit a clinic may be sicker than average. People on Facebook may be younger and more connected. PTA members may be wealthier and more engaged. Any conclusions drawn from these samples may not apply to the broader population.
The uncomfortable truth: Most online polls, social media surveys, and "person on the street" interviews are convenience samples. When a news website asks "Do you think the economy is improving?" and reports that "73% of respondents said yes," that number means almost nothing — because only people who felt strongly enough to click responded, and the website's audience isn't representative of the general public.
Systematic Sampling
Systematic sampling selects members at regular intervals from a list. If Dr. Chen has a list of 50,000 families and wants 500, she could pick every 100th family on the list (50,000 / 500 = 100). She'd randomly choose a starting point between 1 and 100, then select every 100th family from there.
Why it works: It's simpler than true random sampling (no random number generator needed) and usually produces results similar to a random sample, as long as there's no hidden pattern in the list that aligns with the sampling interval.
The catch: If the list has a periodic pattern that matches your sampling interval, you could get a badly biased sample. For example, if apartments in a building are numbered such that every 10th unit is a corner unit (larger, more expensive), sampling every 10th unit would give you only corner units.
The Sampling Methods Comparison
| Method | How It Works | Pros | Cons | Best For |
|---|---|---|---|---|
| Simple Random | Every member has equal chance | Unbiased; simple to understand | Need full list; may miss subgroups | General-purpose when a list exists |
| Stratified | Random sample within defined subgroups | Guarantees subgroup representation; often more precise | Must know strata in advance | Studies comparing subgroups |
| Cluster | Randomly select entire groups | Cost-effective; practical for large areas | Less precise; clusters may be homogeneous | Large, geographically spread populations |
| Convenience | Whoever is easiest to reach | Cheap; fast | High bias risk; not generalizable | Pilot studies; exploratory research only |
| Systematic | Every kth member from a list | Simple to implement | Risk of periodic bias | When random sampling is impractical |
Check Your Understanding (try to answer before looking)
- A professor surveys every student in three randomly selected sections of Intro to Psychology (out of 20 sections). What type of sampling is this?
- A market researcher stands outside a shopping mall and asks passing shoppers about their purchasing habits. What type of sampling is this?
- Why might a stratified sample of 500 be more useful than a simple random sample of 500 for studying asthma rates across income levels?
Verify
- Cluster sampling — the sections are the clusters, and everyone within selected clusters is surveyed.
- Convenience sampling — the researcher is surveying whoever happens to be at that location at that time. The sample is not random and likely overrepresents frequent mall shoppers.
- A stratified sample guarantees representation of each income level. A simple random sample of 500 might, by chance, include very few low-income families — the group with the highest asthma rates and the most important to study. Stratifying ensures each income level has enough observations for meaningful analysis.
4.3 Bias: When Your Data Lies to You
Bias is a systematic tendency for the data collection process to produce results that are consistently wrong in a particular direction. Notice the word "systematic" — this isn't about random fluctuations. Bias is a structural problem, and it doesn't go away with larger samples. If anything, larger samples just give you a more precise wrong answer.
Let me show you what I mean.
Selection Bias
Selection bias occurs when the way you choose your sample systematically excludes certain types of people. The sampling method itself introduces a distortion.
Dr. Chen wants to study asthma in low-income communities, so she partners with a free clinic that serves those neighborhoods. But here's the problem: the families who visit the clinic are the ones who already have health issues or are proactive about healthcare. Families who are working multiple jobs, who lack transportation, or who distrust the medical system may never show up. By recruiting through the clinic, Maya's sample overrepresents health-engaged families and underrepresents the most vulnerable ones.
Response Bias
Response bias occurs when the way you ask questions influences the answers you get. This can happen through:
- Leading questions: "Don't you agree that air pollution is a serious problem?" vs. "How concerned are you about air pollution?" The first question pushes respondents toward a particular answer.
- Social desirability: People tend to overreport "good" behaviors (exercise, voting, recycling) and underreport "bad" ones (smoking, drug use, prejudice). If you ask parents, "Do you smoke around your children?" some will say no even when the answer is yes.
- Recall bias: People don't remember accurately. If Maya asks, "How many times did your child have an asthma attack in the past year?" parents might undercount or overcount depending on how severe the attacks were.
Nonresponse Bias
Nonresponse bias occurs when the people who don't respond are systematically different from those who do. If Dr. Chen mails a survey to 1,000 families and only 200 respond, the 800 non-respondents might be fundamentally different — busier, less interested in health research, or more distrustful of institutions.
This isn't just a theoretical problem. In political polling, nonresponse bias has led to major prediction failures. If people who support one candidate are less likely to answer phone polls, the poll will systematically undercount that candidate's support.
Survivorship Bias
Survivorship bias occurs when you only see the data from "survivors" — the successes, the remaining, the visible — and miss the failures, the departed, the invisible.
A classic example: during World War II, military engineers studied bullet holes in planes returning from combat. They found clusters of bullet holes in the fuselage and wings, so they proposed adding armor to those areas. Mathematician Abraham Wald pointed out the flaw: they were only looking at planes that survived. The planes that were hit in the engines and cockpit never made it back. The armor should go where the bullet holes aren't — because those hits were fatal.
In modern contexts: - Looking only at successful companies to identify "what makes companies successful" ignores all the companies that did the same things and failed. - Studying only currently enrolled students to understand "why students succeed" misses everyone who dropped out. - Analyzing only published studies ignores the studies that found no results and were never published.
Connection to AI: Survivorship bias is a major problem in AI training data. If you train a hiring algorithm on data from successful employees, the algorithm learns what made those specific people succeed — but it never sees the qualified candidates who were never hired in the first place. If the historical hiring process was biased against women or minorities, the algorithm learns to replicate that bias. Amazon discovered this the hard way when its resume-screening AI was found to penalize resumes that contained the word "women's" (as in "women's chess club"). The algorithm had been trained on 10 years of successful hires — who were predominantly male.
4.4 The 1936 Literary Digest Disaster: A Cautionary Tale
This is one of the most famous failures in the history of statistics, and it perfectly illustrates why sample quality matters more than sample size.
In 1936, The Literary Digest magazine conducted a massive poll to predict the U.S. presidential election between Franklin Roosevelt and Alf Landon. They mailed questionnaires to approximately 10 million people and received 2.4 million responses — a staggeringly large sample. Based on those responses, they predicted Landon would win in a landslide, 57% to 43%.
Roosevelt won 62% to 38%. The Literary Digest's prediction wasn't just wrong — it was catastrophically wrong, off by nearly 20 percentage points. The magazine folded within two years.
What went wrong? Two devastating sources of bias:
Selection bias: The magazine drew its mailing list from telephone directories and automobile registration records. In 1936, during the Great Depression, telephones and cars were luxuries. The poll systematically excluded lower-income Americans — who overwhelmingly supported Roosevelt and his New Deal policies. The 10-million-person mailing list was not representative of the voting population.
Nonresponse bias: Of the 10 million surveys mailed, only 2.4 million were returned (a 24% response rate). People who felt strongly about the election — particularly those who wanted change (Landon supporters wanting to unseat Roosevelt) — were more likely to return the survey. The non-respondents were systematically different from the respondents.
Meanwhile, a young pollster named George Gallup surveyed just 50,000 people using more scientific methods and correctly predicted Roosevelt's victory. His sample was 48 times smaller but infinitely more useful, because it was better designed.
The lesson: A biased sample of 2.4 million is worse than a well-designed sample of 50,000. Sample quality beats sample quantity, every time.
Spaced Review (Chapter 1): Remember the four pillars of statistical investigation from Chapter 1? They are: (1) ask a question, (2) collect data, (3) analyze data, and (4) interpret results. The Literary Digest failure was a catastrophic problem at which pillar?
Verify
Pillar 2: collect data. The question was fine (who will win the election?), and the analysis was straightforward (count the votes). But the data collection was fatally flawed — the sample was biased. This shows that even with a clear question and competent analysis, bad data collection can ruin everything. No amount of clever analysis can fix a biased sample.
4.5 Why Randomization Matters
You've probably noticed that the word "random" keeps coming up. Random sampling. Random assignment. Randomization. It's not a coincidence — randomization is the single most powerful tool in statistics.
But why? What's so magical about randomness?
Here's the intuition: randomization protects you from biases you don't even know about.
When Dr. Chen randomly selects families for her survey, she doesn't need to worry about whether she's accidentally overrepresenting one ethnic group, income level, housing type, or family size. Random selection, on average, produces a sample that mirrors the population across all characteristics — including ones she hasn't thought of. It's not perfect for any single sample, but it's unbiased: if she repeated the process many times, the averages would converge to the population truth.
And when she randomly assigns families to receive air purifiers (the treatment) or not (the control), she creates two groups that are roughly equivalent in every way — age, income, pre-existing health conditions, housing quality, stress levels, everything. The only systematic difference between the groups is whether they got the air purifier. So if the treatment group shows better asthma outcomes, she can reasonably attribute that improvement to the air purifier rather than to some other difference between the groups.
Randomization serves two distinct purposes:
-
In sampling (selecting who to study): Random sampling ensures the sample represents the population, so you can generalize your findings.
-
In experiments (deciding who gets what): Random assignment ensures the treatment and control groups are comparable, so you can make causal claims.
These are different roles, and both matter. A study can have random sampling without random assignment (an observational study with a representative sample), random assignment without random sampling (an experiment on a convenience sample), both, or neither. The best studies have both.
4.6 Confounding Variables: The Hidden Threat
Here's the concept that changes how you see the world. Ready?
Ice cream sales and drowning deaths are strongly correlated. In months when ice cream sales go up, drowning deaths also go up. In months when ice cream sales fall, drowning deaths fall too.
Does ice cream cause drowning? Should we ban ice cream to save lives?
Obviously not. There's a third variable lurking behind both: hot weather. When it's hot, people buy more ice cream. When it's hot, people also swim more, and more swimming means more drownings. Temperature drives both variables. Ice cream and drowning are correlated, but neither causes the other.
A confounding variable (or confounder) is a variable that is associated with both the explanatory variable and the response variable, creating a misleading association between them.
This is the concept that makes "correlation does not imply causation" more than just a bumper sticker. Confounding is the mechanism by which correlation misleads us. And it's everywhere:
-
People who eat breakfast tend to be healthier. Confounders: people who eat breakfast are also more likely to exercise, sleep well, and have stable routines. Breakfast might not be the cause of better health — it might just be a marker of a healthier lifestyle.
-
Children with bigger shoe sizes score higher on reading tests. Confounder: age. Older children have bigger feet and better reading skills. Shoes don't cause reading ability.
-
Countries with more Nobel Prize winners also consume more chocolate per capita. Confounder: wealth. Richer countries can afford both chocolate and world-class research institutions.
-
Students who sit in the front of the class get better grades. Confounder: motivation. More motivated students choose to sit in the front and study harder. The seat itself doesn't cause better grades.
Why Experiments Solve the Confounding Problem
This is where experiments have their decisive advantage over observational studies.
When you randomly assign people to a treatment and a control group, the confounding variables get distributed roughly equally between the two groups. Some motivated students end up in the treatment group, and some end up in the control group. Some wealthy families end up in each group. Some health-conscious people end up in each group. Random assignment doesn't eliminate confounders — it balances them across groups, so they can't create a misleading association.
In an observational study, the groups form naturally, and confounders can run wild. If you compare ice cream buyers to non-buyers, the two groups differ in many ways — not just their ice cream consumption. But in an experiment, if you randomly assigned people to eat ice cream or not (bizarre, but bear with me), the ice cream eaters and non-eaters would be comparable in every other way. Any difference in drowning rates would actually be attributable to the ice cream — which, of course, would show no effect.
Sam Okafor's "Natural Experiment"
Sam Okafor has been tracking Daria Thompson's shooting percentage all season. In the first half, she shot 38% from three-point range. In the second half, after working with a new shooting coach, she's shooting 45%. Did the coaching work?
Sam wants to say yes, but he's aware of confounding variables: - The team's schedule was easier in the second half (weaker opponents) - Daria recovered from a nagging wrist injury in January - The team's starting point guard returned from injury, creating better passing and more open shots for Daria
Daria didn't randomly decide when to start working with the new coach. The coaching change happened at a specific point in the season — a point that coincided with other changes. This makes it a "natural experiment" — an observational study where a treatment-like change occurred due to circumstances rather than random assignment. Natural experiments can be informative, but they don't have the confounding protection that true experiments provide.
To be more confident, Sam would want to see whether other players on the team (who didn't get the new coaching) also improved in the second half. If they did, the improvement might be about the schedule or the returning point guard, not the coaching. If only Daria improved, the coaching explanation becomes more compelling — but not proven.
Check Your Understanding (try to answer before looking)
- A study finds that people who drink moderate amounts of wine live longer than non-drinkers. Name one possible confounding variable.
- Explain why random assignment helps solve the confounding problem.
- Sam's colleague suggests comparing Daria's second-half stats to the team average as a way to "control for" the easier schedule. Is this a good idea? Why or why not?
Verify
- Many confounders are possible: income (wealthier people may both drink wine and have better healthcare), social activity (moderate drinkers may have more active social lives, which is associated with longevity), or overall health consciousness (moderate drinkers may also exercise and eat well). The key is that the confounder must be related to both wine drinking and longevity.
- Random assignment distributes confounders roughly equally across the treatment and control groups. Since every individual has an equal chance of being in either group, the groups will be comparable in all characteristics — observed and unobserved. Any difference in outcomes can then be attributed to the treatment, not to confounders.
- It's a reasonable idea but not perfect. Comparing Daria's improvement to the team average helps account for confounders that affected everyone (easier schedule, returning point guard). If the whole team improved by 5 percentage points but Daria improved by 7, the extra 2 points might be attributable to coaching. But this approach can't account for confounders specific to Daria (her wrist recovery, for example). It's better than ignoring confounders entirely, but worse than a true experiment.
4.7 Designing Experiments: Treatment, Control, and Blinding
Let's move from identifying problems to designing solutions. Alex Rivera at StreamVibe is tasked with answering a specific question: does a new homepage layout increase user engagement?
This is a perfect setup for an experiment. Unlike Maya's asthma study (where ethical constraints limit what you can do) or Washington's policing research (where experiments are logistically impossible), Alex can actually run a controlled experiment. Here's how.
The Anatomy of an Experiment
Every well-designed experiment has these components:
Treatment group: The group that receives the intervention — in Alex's case, users who see the new homepage layout.
Control group: The group that doesn't receive the intervention — users who see the existing homepage. The control group provides a baseline for comparison.
Randomization: Users are randomly assigned to the treatment or control group. This ensures the groups are comparable in every way except the homepage layout they see.
Response variable: What you're measuring — in Alex's case, engagement metrics like watch time, number of sessions, or click-through rate.
Explanatory variable: What you're changing — the homepage layout (new vs. old).
A/B Testing: Experiments in the Digital Age
What Alex is designing has a name in the tech industry: an A/B test. Group A sees one version, Group B sees another, and you compare the results.
A/B testing is experimentation at scale. Companies like Google, Netflix, Amazon, and Spotify run thousands of A/B tests per year. Google famously tested 41 shades of blue for link colors to see which one got the most clicks. Netflix tests different thumbnail images for the same show to see which one makes you more likely to watch.
Here's Alex's experimental design:
- Define the question: Does the new homepage layout increase average watch time per session?
- Identify the population: All active StreamVibe users
- Random assignment: When a user logs in, StreamVibe's system randomly assigns them to Group A (old layout) or Group B (new layout). Each user sees the same layout every time they log in.
- Measure the outcome: After two weeks, compare the average watch time per session between the two groups.
- Analyze: Use statistical tests (which we'll learn in Chapters 13-16) to determine whether any difference is large enough to be meaningful — or whether it could be due to chance.
Why A/B testing is so powerful: In the digital world, you can run experiments with millions of users, randomly assigned, with outcomes measured automatically. There's no survey to fill out, no researcher to influence the results, no way for users to opt into their preferred group. It's about as clean as experimental design gets.
The Placebo Effect and Why Blinding Matters
Here's a fascinating quirk of human psychology: sometimes, just believing you're receiving a treatment makes you feel better — even if the treatment is fake.
A placebo is an inactive treatment that looks identical to the real treatment. In drug trials, the placebo might be a sugar pill that looks exactly like the real medication. In Alex's A/B test, there's no placebo issue — users don't know they're in an experiment (which is itself a form of blinding).
The placebo effect is real and powerful. In some pain studies, placebos reduce pain by 30-40%. If you give a group of patients a real painkiller and compare them to a group that gets nothing, you might see a big improvement — but you can't tell how much of that improvement is the drug and how much is the placebo effect. That's why experiments need a placebo group: the comparison should be "drug vs. placebo," not "drug vs. nothing."
Blinding is the practice of keeping participants (and sometimes researchers) unaware of which group they're in.
- Single-blind: The participants don't know whether they're getting the treatment or the placebo, but the researchers do.
- Double-blind: Neither the participants nor the researchers who interact with them know which group is which. This prevents researchers from unconsciously treating the groups differently — for example, being more encouraging to patients they know are getting the real drug.
A double-blind study is one in which neither the participants nor the researchers who interact with them know which group each participant belongs to. It's the gold standard for minimizing bias in experiments.
Why Can't We Always Run Experiments?
If experiments are so great, why don't we always use them?
Because sometimes it would be unethical, impossible, or impractical. Here are the boundaries:
-
You can't randomly assign people to smoke to study the effects of smoking on lung cancer. You can't randomly assign children to drink lead-contaminated water. You can't randomly assign communities to live near toxic waste dumps. Ethics prohibit experiments that would deliberately harm participants.
-
You can't randomly assign some variables. You can't randomly assign someone's gender, race, age, or socioeconomic status. These are characteristics, not treatments. Studies involving these variables are necessarily observational.
-
Some experiments are impractical. You can't randomly assign people to different political systems to study the effects of democracy vs. authoritarianism. You can't randomly assign the weather to study climate effects on agriculture.
This is why both observational studies and experiments are essential in the scientific toolkit. They're complementary, not competing.
Ethical Analysis: Informed Consent
Here's an ethical question worth sitting with: Alex's A/B test at StreamVibe runs without users knowing they're in an experiment. Is that ethical?
Most tech companies argue that A/B testing falls under normal product optimization — users agree to potential interface changes when they accept the terms of service. But critics point out that some A/B tests go further. In 2014, Facebook ran an experiment on 689,003 users, manipulating their news feeds to contain more positive or more negative content, then measuring whether this affected the emotional tone of users' own posts. Users were not informed. The study was published in a scientific journal and provoked enormous backlash.
The core principle in research ethics is informed consent — the idea that participants should know they're in a study and agree to participate. This principle emerged from horrific historical abuses: the Tuskegee syphilis study (where Black men with syphilis were deliberately left untreated for decades), Nazi medical experiments, and others.
Today, any study involving human subjects at a university must be approved by an Institutional Review Board (IRB). But tech companies' internal experiments often fall outside this oversight. As data collection and experimentation become more pervasive, the question of where to draw the line becomes increasingly important.
Discussion questions: 1. Is there a meaningful difference between testing two shades of blue for a button and testing whether emotional content changes user behavior? 2. Should online A/B tests require informed consent? What would be the practical consequences? 3. What safeguards should exist for experiments conducted by companies rather than researchers?
4.8 Evaluating Causal Claims: A Checklist
You've now seen the building blocks: observational studies vs. experiments, sampling methods, bias, confounding, randomization, and blinding. Let's put it all together into a practical tool you can use whenever someone says "X causes Y."
The Causal Claims Checklist
When you encounter a claim of causation — in a news article, a research paper, an advertisement, or a social media post — run through these questions:
1. Was this an observational study or an experiment? - If observational: the study can show association but not causation. Be skeptical of causal language. - If experimental: causation is possible, but check the next questions.
2. If it was an experiment, was there random assignment? - Random assignment is what allows causal conclusions. Without it, confounders may explain the results.
3. Was there a control group? - Without a baseline comparison, you can't tell whether the treatment had an effect or whether the outcome would have happened anyway.
4. Was blinding used? - Single-blind? Double-blind? Unblinded? The less blinding, the more room for bias.
5. Can you think of a confounding variable? - If you can think of a plausible third variable that explains the association, be cautious. If the study controlled for known confounders, that strengthens the claim.
6. How was the sample selected? - Random sample → results may generalize to the population - Convenience sample → results may only apply to the specific group studied
7. How large was the sample? - Larger samples give more precise estimates (we'll formalize this in Chapter 11) - But sample quality matters more than sample size
8. Has the finding been replicated? - A single study is a single data point. Replication by independent researchers strengthens confidence enormously.
Research Study Breakdown
Let's apply this checklist to a real study.
Study: Researchers at Harvard tracked 120,000 nurses over 30 years, recording their diet, exercise, and health outcomes. They found that women who ate more processed meat had a higher risk of heart disease.
1. Observational or experiment? Observational. The researchers didn't tell nurses what to eat — they observed existing behavior. (The Nurses' Health Study is one of the most famous observational studies in medical history.)
2. Random assignment? No — this is observational. Nurses chose their own diets.
3. Control group? No formal control, but the researchers compared nurses with different eating habits.
4. Blinding? Not applicable (observational study).
5. Confounders? Many possible: nurses who eat a lot of processed meat might also exercise less, smoke more, have lower incomes, or eat fewer vegetables. The researchers tried to statistically adjust for these confounders, which strengthens the finding — but can never fully eliminate confounding.
6. Sample? 120,000 is a large sample, but all nurses, all women, mostly white — not necessarily representative of the general population. Results might not apply to men or to non-healthcare-workers.
7. Sample size? Very large (120,000), and followed over 30 years, which is impressive.
8. Replication? Yes — many other studies (including some with men) have found similar associations, strengthening the conclusion.
Bottom line: This study provides strong evidence that processed meat is associated with heart disease, but as an observational study, it cannot definitively prove causation. The large sample, long follow-up, and extensive replication make the finding highly credible, but some uncertainty remains due to potential unmeasured confounders.
4.9 Connection to AI: Training Data as a Sample
Here's something that ties this entire chapter to the world you live in: every AI model is built on a sample.
When OpenAI trained GPT, it used a massive dataset of text from the internet. When Spotify trains its recommendation algorithm, it uses listening data from its users. When a hospital trains a diagnostic AI, it uses medical images from its patient records. In every case, the training data is a sample of the broader universe of possible data — and all the problems we've discussed in this chapter apply.
Biased sampling → Biased AI. If the training data overrepresents certain groups and underrepresents others, the AI inherits those biases. Facial recognition systems trained primarily on white faces perform worse on faces with darker skin tones. Language models trained primarily on English text perform worse in other languages. Medical AI trained on data from wealthy hospitals may miss diagnoses common in underserved communities.
Confounding in AI. If a criminal justice algorithm is trained on data from a system where Black defendants have historically received harsher sentences, the algorithm may learn that race predicts recidivism — not because it does, but because the training data confounds race with systemic bias. This is exactly the problem Professor Washington is studying.
Convenience sampling in AI. Most AI training data is a convenience sample — whatever data was available. Internet text overrepresents English speakers, men, younger demographics, and people with internet access. Medical data overrepresents patients at large teaching hospitals. Social media data overrepresents the loudest voices. These are not random samples of humanity.
The key takeaway: The principles of sampling and study design don't just apply to surveys and clinical trials. They apply to every AI system that makes decisions about your life. Understanding bias, confounding, and the limitations of observational data makes you a better consumer of AI — and a better citizen in a world increasingly shaped by algorithms.
4.10 Project Checkpoint: Evaluating Your Dataset's Collection Method
Time to apply what you've learned to your Data Detective Portfolio. Open your Jupyter notebook from Chapter 3 and add a new section.
Your Tasks
1. Create a new Markdown cell with the heading: "Data Collection Evaluation"
2. Research how your dataset was collected. Answer these questions in text cells:
- Who collected the data? (Government agency? Researchers? A company? Volunteers?)
- How were the participants or observations selected? (Random sample? Convenience sample? Census? Voluntary response?)
- What is the target population? (Who is the dataset trying to represent?)
- What is the actual sample? (Who is actually included in the dataset?)
- Is this an observational study or an experiment?
3. Identify at least two potential sources of bias:
- Could there be selection bias? (Who might be systematically excluded?)
- Could there be response bias? (Are there questions where people might not answer honestly?)
- Could there be nonresponse bias? (Are non-respondents likely different from respondents?)
- Could there be survivorship bias? (Are you only seeing the "survivors"?)
4. Write a paragraph assessing what conclusions your dataset can and cannot support.
For example: "This dataset is a stratified random sample of U.S. adults, which supports generalizing to the U.S. adult population. However, it's an observational study, so associations between variables (like the association between smoking and BMI) cannot be interpreted as causal."
5. (Optional) If your dataset involved human subjects, identify any ethical considerations:
- Were participants likely aware their data was being collected?
- Was the data anonymized?
- Could the data be used in ways participants didn't consent to?
Suggested datasets and what to look for: - CDC BRFSS: Phone survey — who doesn't have phones? Who doesn't answer? Who drops out? - Gapminder: Country-level data — what about within-country variation? How reliable are statistics from all countries? - World Happiness Report: Self-reported happiness — cultural differences in reporting? Translation issues? - College Scorecard: Administrative data — only includes Title IV institutions. What's missing? - NOAA Climate Data: Sensor-based — where are sensors located? Are some regions underrepresented?
4.11 Spaced Review: Strengthening Previous Learning
These questions revisit concepts from earlier chapters at expanding intervals, helping you build long-term retention.
SR.1 (From Chapter 1 — Four Pillars): The Literary Digest poll of 1936 failed spectacularly at Pillar 2 (data collection). Describe a scenario where someone might have an excellent sample (Pillar 2) but still reach wrong conclusions because of a failure at Pillar 4 (interpreting results). Hint: think about confounding.
Check your thinking
A researcher conducts a perfectly randomized survey and finds a strong correlation between coffee consumption and lower rates of depression. The sample is representative (Pillar 2 is solid), but the researcher interprets this as "coffee prevents depression" — a causal claim from observational data. This is a Pillar 4 failure: misinterpreting an association as causation. Possible confounders include social activity (people who drink coffee may also spend more time socializing in coffee shops) or personality differences. Good data collection doesn't guarantee good interpretation.SR.2 (From Chapter 2 — Parameter vs. Statistic): In the 1936 Literary Digest poll, 57% of respondents said they would vote for Landon. Is 57% a parameter or a statistic? What was the actual parameter the poll was trying to estimate? Why was the statistic so far from the parameter?
Check your thinking
The 57% is a **statistic** — it describes the sample (the 2.4 million respondents). The **parameter** the poll was trying to estimate was the true proportion of all voters who would vote for Landon — which turned out to be about 38%. The statistic was far from the parameter because of **bias** in both the sampling method (selection bias from using phone and car registration lists) and in who responded (nonresponse bias). The sample was not representative of the population, so the statistic was a poor estimate of the parameter.SR.3 (From Chapter 2 — Cross-Sectional vs. Longitudinal): Dr. Chen could study asthma in two ways: (a) survey families across 50 neighborhoods at one point in time, or (b) follow 500 families over five years, measuring asthma outcomes annually. Which approach is cross-sectional and which is longitudinal? What advantage does the longitudinal approach offer for studying the effect of an intervention like air purifiers?
Check your thinking
Approach (a) is **cross-sectional** — a snapshot of many groups at one point in time. Approach (b) is **longitudinal** — following the same individuals over time. The longitudinal approach is better for studying interventions because you can compare each family's asthma outcomes *before and after* receiving the air purifier, using each family as its own control. This reduces confounding because pre-existing differences between families (income, housing quality, genetics) are held constant. In a cross-sectional study, you'd compare different families with and without purifiers, and all those pre-existing differences become potential confounders.Chapter Summary
Let's step back and see the big picture of what you've learned.
The Big Ideas
-
The way data is collected determines what it can tell you. No analysis technique — no matter how sophisticated — can overcome a fundamentally flawed data collection process.
-
Observational studies show association; experiments show causation. This is the single most important distinction in study design. When someone claims "X causes Y," your first question should be: was this an experiment with random assignment?
-
Confounding variables create misleading associations. A confounder is related to both variables you're studying, making it look like one causes the other. Ice cream doesn't cause drowning. Shoes don't cause reading ability. Randomization in experiments is the best defense against confounding.
-
Sample quality beats sample quantity. A biased sample of millions is worse than a well-designed sample of thousands. The 1936 Literary Digest poll proved this dramatically.
-
Bias is systematic, not random. Selection bias, response bias, nonresponse bias, and survivorship bias all push your results in a particular direction — and more data doesn't fix the problem.
Key Terms
| Term | Definition |
|---|---|
| Observational study | A study that observes and measures without intervening |
| Experiment | A study that deliberately imposes a treatment to observe responses |
| Random sample | A sample in which every member of the population has an equal chance of being selected |
| Stratified sampling | Dividing the population into subgroups and randomly sampling within each |
| Cluster sampling | Randomly selecting entire groups (clusters) to include |
| Convenience sample | Sampling whoever is easiest to reach (high bias risk) |
| Systematic sampling | Selecting every kth member from a list |
| Bias | A systematic tendency to produce results that are wrong in a particular direction |
| Confounding variable | A variable associated with both the explanatory and response variables, distorting their apparent relationship |
| Randomization | Using chance to select samples or assign treatments, protecting against known and unknown biases |
| Control group | The group in an experiment that does not receive the treatment |
| Treatment group | The group in an experiment that receives the treatment |
| Placebo | An inactive treatment that looks identical to the real treatment |
| Blinding | Keeping participants (and/or researchers) unaware of group assignments |
| Double-blind | A study in which neither participants nor researchers know who is in which group |
What's Next
You now have a critical eye for data collection. You can look at any study, any headline, any AI system, and ask: How was this data collected? Is the sample representative? Could there be confounders? Can this support causal conclusions?
In Chapter 5: Exploring Data: Graphs and Descriptive Statistics, you'll learn to see data — to create histograms, bar charts, and scatterplots that reveal patterns, outliers, and relationships that summary statistics alone can miss. You'll combine your pandas skills from Chapter 3 with new visualization libraries (matplotlib and seaborn) to bring your data to life.
In Chapter 6, you'll dive deep into numerical summaries — mean, median, standard deviation, and percentiles — the numbers that quantify the patterns your graphs reveal.
And every time you create a graph or calculate a statistic, the concepts from this chapter will be running in the background: How was this data collected? Who is in the sample? What can I actually conclude? Those questions never go away. They're the foundation of honest statistical practice.
The tools are in your hands. The vocabulary is in place. The critical eye is sharpening. Now let's see what the data looks like.