Chapter 27: Lies, Damn Lies, and Statistics: Ethical Data Practice

Contributors

42 min read

> "There are three kinds of lies: lies, damned lies, and statistics."

Prerequisites

13
17
25
26
Familiarity with hypothesis testing, p-values, and effect sizes
Understanding of confidence intervals and their interpretation

Learning Objectives

Identify common ways statistics are misused or misrepresented
Apply ethical frameworks to data collection, analysis, and reporting
Evaluate the ethical implications of data-driven decision-making
Recognize p-hacking, HARKing, and other questionable research practices
Develop a personal code of statistical ethics

In This Chapter

Chapter Overview
27.1 A Puzzle Before We Start (Productive Struggle)
27.2 Simpson's Paradox: When Aggregation Lies
27.3 The Ecological Fallacy: The Danger of Group-Level Reasoning
27.4 Lying with True Statistics: How to Deceive Without Fabricating
27.5 Questionable Research Practices: P-Hacking, HARKing, and the Gray Zone
27.6 Research Ethics: Protecting Human Subjects
27.7 The Ethics of Data-Driven Decision-Making
27.8 Ethical Frameworks for Data Practice
27.9 Theme 5: Correlation vs. Causation as Ethical Imperative
27.10 Debate Framework: Is It Ethical to Use Data to Make Decisions About Individuals?
27.11 A Personal Code of Statistical Ethics
27.12 Python: Detecting Simpson's Paradox
27.13 Theme 2: Human Stories Behind the Data
27.14 Progressive Project: Add an Ethics Section to Your Portfolio
27.15 What's Next?
Key Terms

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 27: Lies, Damn Lies, and Statistics: Ethical Data Practice

"There are three kinds of lies: lies, damned lies, and statistics." — Often attributed to Mark Twain (who attributed it to Benjamin Disraeli, who probably didn't say it either — which is itself a lesson about checking your sources)

Chapter Overview

Let me tell you about a study that changed the world.

In 1932, the United States Public Health Service began a research project in Macon County, Alabama. The researchers told 399 Black men with syphilis that they were being treated for "bad blood." They were not being treated. For forty years — forty years — these men were deliberately denied treatment, even after penicillin became the standard cure in 1947. The researchers wanted to observe the natural progression of untreated syphilis. Participants were never told they had syphilis. They were never given the option to leave. They were never offered treatment that could have saved their lives.

The Tuskegee Syphilis Study wasn't bad science in the technical sense. The data collection was meticulous. The sample was well-defined. The longitudinal design was rigorous. By every metric we've discussed in this textbook — clear research question, consistent methodology, systematic data collection — it was a well-designed study.

And it was an atrocity.

This is the chapter where we confront the hardest truth in statistics: knowing how to do something doesn't tell you whether you should. Every technique you've learned — from calculating means to building regression models to testing hypotheses — can be used ethically or unethically. The math doesn't care. The formulas work the same whether you're curing diseases or deceiving people. The ethics live in the choices you make: what questions you ask, whose data you collect, how you analyze it, what you report, and what you leave out.

We've been building toward this chapter for twenty-six chapters. In Chapter 4, we introduced informed consent and the IRB. In Chapter 6, we saw how choosing the mean versus the median can tell different stories about the same population. In Chapter 7, we learned that every data cleaning decision affects whose stories get told. In Chapter 9, the prosecutor's fallacy showed how misused statistics can send innocent people to prison. In Chapter 13, we confronted p-hacking and the garden of forking paths. In Chapter 17, we examined the replication crisis and publication bias.

Now we tie all of those threads together.

This chapter is different from the others. There's less math and more judgment. Fewer formulas and more arguments. The Bloom's ceiling here is evaluate — which means I'm going to ask you not just to understand ethical principles but to apply them to messy, real-world situations where reasonable people disagree.

Here's why this matters right now: Maya is wrestling with a public health dataset that could save lives — but publishing it could also identify specific communities and expose them to stigma. Alex is analyzing the ethics of A/B tests that manipulate user emotions without consent. James is confronting the full weight of algorithmic decision-making in criminal justice, where statistical models literally determine who goes free and who goes to jail. And Sam is watching cherry-picked statistics being used to inflate a player's value in contract negotiations.

Each of them is facing a question that no formula can answer.

In this chapter, you will learn to: - Identify common ways statistics are misused or misrepresented - Apply ethical frameworks to data collection, analysis, and reporting - Evaluate the ethical implications of data-driven decision-making - Recognize p-hacking, HARKing, and other questionable research practices - Develop a personal code of statistical ethics

Fast Track: If you're already familiar with research ethics basics (IRB, informed consent) from Chapter 4, skim Sections 27.5–27.6 and focus on Sections 27.2 (Simpson's paradox), 27.3 (questionable research practices), and 27.8 (ethical frameworks). Complete quiz questions 1, 7, and 15 to verify.

Deep Dive: After this chapter, work through Case Study 1 (Maya's public health data dilemma) for a nuanced analysis of privacy versus public good, then Case Study 2 (James's algorithmic decision-making) for a full ethical evaluation of data-driven criminal justice. Both include structured debate frameworks.

27.1 A Puzzle Before We Start (Productive Struggle)

Before we define any terms, look at this data.

The Berkeley Admissions Puzzle

In 1973, the University of California, Berkeley was sued for gender discrimination in graduate admissions. The numbers seemed damning:

Applicants Admitted Admission Rate

Men 8,442 3,738 44%

Women 4,321 1,494 35%

That's a 9-percentage-point gap. Men were admitted at a significantly higher rate than women. Case closed?

Not so fast. When a statistician named Peter Bickel and his colleagues broke the data down by department, they found something astonishing. In four of the six largest departments, women were admitted at a higher rate than men. In most of the remaining departments, the rates were roughly equal.

Here's a simplified version of two departments:

Department Men Applied Men Admitted Men Rate Women Applied Women Admitted Women Rate

A (easy to get into) 825 512 62% 108 89 82%

B (hard to get into) 560 201 36% 593 202 34%

Women had a higher admission rate in Department A. Women had a nearly identical rate in Department B. And yet, overall, women had a lower admission rate.

(a) How is this mathematically possible? (Hint: look at where women applied.)

(b) Does the aggregate data or the department-level data tell the "true" story?

(c) If you were the university's lawyer, which data would you present? If you were the plaintiff's lawyer? What does that choice say about ethics?

Take 5 minutes. This puzzle will change how you think about data forever.

Here's what happened: women disproportionately applied to Department B, which was highly competitive and rejected most applicants regardless of gender. Men disproportionately applied to Department A, which accepted most applicants. When you aggregate the data, it looks like discrimination against women. But the department-level data shows the opposite.

The aggregate data told a lie. Not because anyone fabricated numbers — every number in the table was true. But because the aggregate data hid a confounding variable: the department of application. Women's lower overall admission rate was driven by their choice to apply to more selective departments, not by discrimination within departments.

This is called Simpson's paradox, and it is the threshold concept for this chapter.

27.2 Simpson's Paradox: When Aggregation Lies

Threshold Concept Alert

Simpson's paradox occurs when a trend that appears in several different groups of data reverses or disappears when the groups are combined. It is one of the most counterintuitive results in all of statistics, and it has profound ethical implications: the same data can tell opposite stories depending on whether you look at it in aggregate or broken down by subgroups.

The Formal Definition

Simpson's paradox is the phenomenon in which a statistical trend that exists in multiple subgroups reverses or vanishes when the subgroups are combined. It occurs when a lurking variable (the confounding factor) is unevenly distributed across the groups being compared.

In the Berkeley case: - The lurking variable was department selectivity - Women applied more heavily to selective departments (low admission rates for everyone) - Men applied more heavily to less selective departments (high admission rates for everyone) - Aggregating across departments mixed the effect of who applied where with the effect of how each department treated applicants

Why It Matters Ethically

Here's the part that should keep you up at night. Both stories — "Berkeley discriminated against women" and "Berkeley actually favored women" — are supported by real data. Neither involves fabrication. The ethical question is: which level of analysis is the right one to report?

The answer depends on the question you're asking: - "Did the university as a system disadvantage women?" → The aggregate data is relevant, because women as a group were admitted at a lower rate, regardless of the mechanism. - "Did individual departments discriminate against women?" → The department-level data is relevant, because it isolates the treatment within comparable groups.

Neither answer is wrong. But presenting only one without acknowledging the other is ethically problematic. If you're a university administrator presenting the department-level data to dismiss a lawsuit, you're hiding the systemic outcome. If you're a plaintiff's lawyer presenting only the aggregate data, you're hiding the actual mechanism.

Key Insight: Simpson's paradox teaches us that ethical data practice requires transparency about the level of aggregation. Whenever you present aggregate statistics, ask yourself: could this trend reverse at a finer level of analysis? And if you present subgroup data, ask: am I losing sight of the bigger picture?

Simpson's Paradox in Medicine

Simpson's paradox isn't just an academic curiosity. It shows up in life-or-death decisions.

In the 1980s, a study compared two treatments for kidney stones: open surgery (Treatment A) and a new, less invasive procedure (Treatment B). Here's the aggregate data:

Treatment	Success Rate
A (open surgery)	78% (273/350)
B (new procedure)	83% (289/350)

Treatment B looks better, right? But when researchers broke the data down by stone size:

	Treatment A	Treatment B
Small stones	93% (81/87)	87% (234/270)
Large stones	73% (192/263)	69% (55/80)

Treatment A was better for both small stones and large stones. The aggregate data was misleading because doctors tended to use the more aggressive Treatment A for large stones (harder cases) and Treatment B for small stones (easier cases). The confounding variable — severity of illness — was unevenly distributed across treatments.

If you were a patient relying on the aggregate data, you'd choose the wrong treatment.

Connection to Chapter 4 (Spaced Review SR.3): This is confounding in action — the same concept we introduced in Chapter 4 when we discussed how confounding variables can make observational studies misleading. Simpson's paradox is the most dramatic consequence of uncontrolled confounding. In Chapter 4, we said that randomized experiments protect against confounding by distributing lurking variables evenly across groups. Simpson's paradox shows exactly what can go wrong when randomization isn't possible.

Recognizing Simpson's Paradox

Simpson's paradox can occur whenever: 1. You're comparing groups (men vs. women, treatments, time periods) 2. There's a third variable (department, severity, demographics) that is unevenly distributed across the groups 3. The third variable is correlated with the outcome

The solution isn't always to disaggregate. Sometimes the aggregate story is the one that matters. The solution is to check both levels and be transparent about what each one shows.

import pandas as pd
import numpy as np

# ============================================================
# SIMPSON'S PARADOX: UC BERKELEY ADMISSIONS
# Demonstrating how aggregation can reverse a trend
# ============================================================

# Simplified data from Bickel, Hammel, and O'Connell (1975)
data = {
    'Department': ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'],
    'Gender': ['Male', 'Female', 'Male', 'Female',
               'Male', 'Female', 'Male', 'Female'],
    'Applicants': [825, 108, 560, 593, 325, 417, 417, 375],
    'Admitted': [512, 89, 201, 202, 120, 132, 138, 131]
}

df = pd.DataFrame(data)
df['Rate'] = (df['Admitted'] / df['Applicants'] * 100).round(1)

# Aggregate view: the "discriminatory" story
print("=" * 50)
print("AGGREGATE VIEW")
print("=" * 50)
agg = df.groupby('Gender')[['Applicants', 'Admitted']].sum()
agg['Rate'] = (agg['Admitted'] / agg['Applicants'] * 100).round(1)
print(agg)
print()

# Department-level view: the "no discrimination" story
print("=" * 50)
print("DEPARTMENT-LEVEL VIEW")
print("=" * 50)
dept = df.pivot_table(
    values=['Applicants', 'Admitted', 'Rate'],
    index='Department',
    columns='Gender'
)
print(dept.round(1))
print()

# The key insight: where did each gender apply?
print("=" * 50)
print("APPLICATION PATTERNS (the lurking variable)")
print("=" * 50)
for gender in ['Male', 'Female']:
    subset = df[df['Gender'] == gender]
    total = subset['Applicants'].sum()
    print(f"\n{gender} applicants:")
    for _, row in subset.iterrows():
        pct = row['Applicants'] / total * 100
        print(f"  Dept {row['Department']}: {pct:.1f}% "
              f"(admission rate: {row['Rate']}%)")

27.3 The Ecological Fallacy: The Danger of Group-Level Reasoning

Simpson's paradox has a close cousin that's equally dangerous: the ecological fallacy.

The ecological fallacy occurs when you draw conclusions about individuals based on data about groups. It sounds subtle, but it's one of the most common — and most harmful — errors in applied statistics.

A Concrete Example

Imagine you're analyzing health data and you find that states with higher ice cream consumption have higher rates of heart disease. Would it be reasonable to conclude that eating ice cream causes heart disease?

That would be the ecological fallacy at work. The data is about states (groups), not about individuals. It's entirely possible — even likely — that within each state, the people eating the most ice cream are not the same people experiencing heart disease. The state-level correlation could be driven by a confounding variable like average income, temperature, or urbanization.

Here's a more consequential example. In the 1950s, sociologist William S. Robinson compared literacy rates and immigrant populations across U.S. states. States with more immigrants had higher literacy rates. Does that mean immigrants were more literate? No — Robinson showed that immigrants were less literate than native-born Americans on average. The state-level pattern existed because immigrants clustered in states (like New York and California) that also had highly educated native-born populations.

If a policymaker had used the state-level data to argue "immigration improves literacy," they would have been dead wrong — not because the data was fake, but because the data described states, not people.

Why the Ecological Fallacy Is an Ethical Issue

The ecological fallacy becomes an ethical problem when group-level statistics are used to make decisions about individuals:

Profiling: "This zip code has high crime rates, so people from this zip code are probably criminals." The crime rate describes the area, not the person.
Hiring: "Graduates from this university have lower average GPAs, so this applicant is probably less qualified." The average describes the institution, not the student.
Healthcare: "This demographic group has higher rates of substance abuse, so this patient is probably seeking drugs." The statistic describes the group, not the patient.

Connection to Chapter 2 (Spaced Review): In Chapter 2, we discussed the difference between population-level patterns and individual observations. The ecological fallacy is the logical error of confusing these two levels. Every time you apply a group statistic to an individual, you're committing a version of this fallacy.

The Lesson

The ecological fallacy doesn't mean group-level data is useless. It means that group-level data answers group-level questions, and individual-level data answers individual-level questions. Mixing the two up isn't just a statistical error — it's an ethical one when it harms people.

27.4 Lying with True Statistics: How to Deceive Without Fabricating

You don't need to make up numbers to lie with statistics. You just need to be selective.

This section catalogs the most common ways true statistics are used to tell false stories. We've touched on some of these in earlier chapters — misleading graphs in Chapter 5, truncated axes in Chapter 25 — but here we treat them as ethical violations, not just technical mistakes.

Cherry-Picking

Cherry-picking is the practice of selecting data, time ranges, subgroups, or analyses that support your preferred conclusion while ignoring those that don't.

Sam Okafor has seen cherry-picking firsthand. During contract negotiations for the Riverside Raptors, a player's agent presented these statistics:

"In the last 15 games, Daria has shot 42% from three-point range — well above the league average of 36%."

Impressive! But Sam pulled up the full season data:

"Over the full 82-game season, Daria shot 33% from three — below the league average."

The agent selected the 15-game window where Daria happened to run hot. Over 82 games, the numbers tell a different story. Neither number is a lie. But choosing which number to present — and only presenting one — is a form of deception.

Cherry-picking shows up everywhere: - Pharmaceutical companies reporting the one clinical trial that showed a positive result out of five trials conducted - Politicians choosing the start date of an economic chart to make their administration look better - News outlets quoting the most dramatic statistic from a study while ignoring the caveats - Companies using survey questions with cherry-picked wording that guides respondents toward desired answers

The Ethical Test: If you need to select specific data to make your point, ask yourself: Would my conclusion change if I used all of the available data? If yes, you're cherry-picking. If you have a legitimate reason for restricting your data (e.g., the older data was collected under different conditions), state that reason explicitly and let your audience decide.

Misleading with Denominators

One of the most powerful tools for deception is choosing what goes in the denominator.

Consider two ways to describe the same crime data:

Version A: "Violent crime increased 50% this year" (from 100 incidents to 150)

Version B: "The violent crime rate increased from 0.1% to 0.15% of the population"

Version A sounds alarming. Version B sounds trivial. Both are accurate. The denominator — total population versus previous year's count — determines the emotional impact.

This technique is used constantly in health reporting. "The risk doubles!" might mean going from 1 in a million to 2 in a million. In absolute terms, the increase is negligible. In relative terms, it sounds catastrophic.

The Ethical Rule: Always report both absolute and relative numbers. If someone only gives you one, demand the other.

Survivorship Bias Revisited

Remember survivorship bias from Chapter 4? It's worth revisiting as an ethical issue. When you study only the "survivors" — successful companies, published studies, recovered patients — you systematically overestimate the probability of success.

This becomes unethical when it influences decisions: - "The average startup founder dropped out of college" — but you're only looking at the ones who succeeded. Most college dropouts don't become billionaires. - "This treatment cured 90% of patients" — but the 10% who died weren't tracked, or the patients who were too sick to participate were excluded from the study. - "These schools have a 100% college acceptance rate" — because students who wouldn't get in were counseled out before applying.

The Anatomy of a Misleading (But True) Claim

Here's a masterclass in how to lie without lying, using a hypothetical example about a diet pill:

Start with a real study. The pill was tested on 200 people. After 12 weeks, participants lost an average of 2.1 pounds compared to the placebo group.
Cherry-pick the subgroup. In the subgroup of women over 40 (n = 31), the average weight loss was 6.8 pounds.
Choose the right denominator. "Lost 6.8 pounds — that's over half a pound per week!"
Suppress the confidence interval. The 95% CI for the subgroup was (-1.2, 14.8) pounds — not statistically significant.
Use relative language. "3.2× more effective than diet alone!" (because the diet-only group lost 2.1 pounds, and 6.8 / 2.1 = 3.24)
Add a testimonial. Feature the one participant who lost 15 pounds. Don't mention the participants who gained weight.

Every claim in that advertisement is technically true. And the overall impression is completely misleading.

Theme 6 Deep Dive: Ethical Data Practice

This is the ethical heart of statistics. The techniques in this chapter are not exotic — they're mundane. Cherry-picking, denominator games, and selective reporting are so common that we barely notice them anymore. They appear in newspaper articles, corporate earnings calls, political speeches, and yes, in published academic research.

The ethical data practitioner isn't someone who never makes mistakes. It's someone who asks, before every analysis: Am I presenting this data in a way that my audience would find complete and fair? If someone who disagreed with my conclusion saw my full analysis, would they say I was honest?

That's the bar. Not "did I fabricate data" — almost nobody does that. But "did I present it in a way that would mislead a reasonable person?"

27.5 Questionable Research Practices: P-Hacking, HARKing, and the Gray Zone

Spaced Review SR.1: p-values and their misuse (from Ch.13)

Quick check: What does a p-value of 0.03 actually mean?

If you said "there's a 3% chance the null hypothesis is true," go back and re-read Chapter 13, Section 13.6. A p-value is the probability of observing data as extreme as or more extreme than what was observed, assuming the null hypothesis is true. It's P(data | H₀), not P(H₀ | data). Confusing these two is the prosecutor's fallacy from Chapter 9.

The p-value is a useful tool when used correctly. The problems begin when we treat it as something it's not: a measure of truth, a guarantee of importance, or a binary stamp of approval.

In Chapter 13, we introduced p-hacking and the garden of forking paths. In Chapter 17, we explored how p-hacking contributes to the replication crisis. Now let's go deeper into the ecosystem of questionable research practices (QRPs) — techniques that fall short of outright fraud but systematically distort the scientific record.

P-Hacking: A Deeper Look

P-hacking is the practice of trying multiple analyses, data transformations, subgroup selections, or variable definitions until you find a statistically significant result (p < 0.05).

In Chapter 13, we showed that running 20 independent tests at α = 0.05 gives you a 64% chance of finding at least one "significant" result by pure chance. But real-world p-hacking is even more insidious than that, because the researcher often doesn't realize they're doing it.

Here's what p-hacking looks like in practice:

The Garden of Forking Paths (Gelman & Loken, 2013)

A researcher collects data on whether a new teaching method improves test scores. Before analyzing, there are dozens of defensible choices:

Should I compare mean scores or median scores?

Should I include all students or only those who attended every session?

Should I control for prior GPA? For gender? For age?

Should I use a t-test or a Mann-Whitney U test?

Should I remove outliers? How should I define outliers?

Should I analyze the full test or just the hardest questions?

Is the effect stronger for math questions or reading questions?

Each choice is individually defensible. But if you make each choice after seeing the data, you're effectively running a different analysis for each combination — and the more combinations you try, the more likely you are to find something "significant" by chance.

The fix is conceptually simple: pre-register your analysis plan. Before you collect data — or before you look at it — commit to your hypotheses, your analysis methods, your inclusion criteria, and your primary outcome variable. Then do exactly that analysis, and report the result whether it's significant or not.

Connection to Chapter 17 (Spaced Review SR.2): In Chapter 17, we discussed publication bias — the tendency for journals to publish "significant" results and reject null results. P-hacking and publication bias form a vicious cycle: researchers p-hack to get significant results because journals only publish significant results. The result is a scientific literature in which many published findings are false positives. Chapter 17's solutions — pre-registration, registered reports, and open data — are the antidote.

HARKing: Hypothesizing After Results Are Known

HARKing (Hypothesizing After Results are Known) is the practice of presenting a post-hoc finding — something you discovered by exploring the data — as if it were a hypothesis you had from the beginning.

Here's how it works:

Researcher predicts: "The new drug will lower blood pressure."
The drug doesn't lower blood pressure (p = 0.47). Disappointment.
While exploring the data, the researcher notices that among patients over 65, the drug did lower blood pressure (p = 0.03).
The researcher writes the paper as if they had predicted all along: "We hypothesized that the drug would be effective in elderly patients."

The published paper looks clean and hypothesis-driven. But the hypothesis was generated by the data, not by prior theory. The "significant" finding might easily be a false positive from multiple implicit comparisons.

HARKing is disturbingly common. In one survey, 58% of researchers in psychology admitted to deciding whether to exclude data after looking at the effect on results, and 35% admitted to reporting unexpected findings as if they'd been predicted (John, Loewenstein, & Prelec, 2012).

Why HARKing Is Ethically Problematic: - It misrepresents the discovery process - It inflates the apparent evidence for a finding (a pre-planned test has more evidential value than a post-hoc one) - It hides the multiple comparisons that generated the finding - It makes the result look more certain than it is

The ethical alternative: distinguish confirmatory from exploratory analysis. It's perfectly fine to explore your data and discover unexpected patterns. That's part of science. But when you report those patterns, label them as exploratory. Say: "We did not predict this finding; it emerged during exploratory analysis and should be treated as hypothesis-generating rather than hypothesis-confirming."

The Replication Crisis: A Wake-Up Call

Spaced Review SR.3: Publication bias and the replication crisis (from Ch.17)

The replication crisis isn't just an abstract concern. In 2015, the Open Science Collaboration attempted to replicate 100 published psychology experiments. Only 36% of the replications achieved statistical significance. The median effect size in replications was half of the original. In cancer biology, the Reproducibility Project found that only 46% of findings replicated.

As we discussed in Chapter 17, the crisis was driven by four interacting factors: underpowered studies, publication bias, p-hacking, and binary threshold thinking. The solutions are systemic: pre-registration, registered reports, open data, open code, and effect size reporting.

The replication crisis has led to a fundamental rethinking of scientific practice. Many journals now require or encourage:

Pre-registration on platforms like OSF (Open Science Framework) or AsPredicted.org
Registered reports, where peer review happens before data collection — the paper is accepted or rejected based on the question and methods, not the results
Open data and open code, so other researchers can verify and extend findings
Effect size reporting alongside (or instead of) p-values

These reforms are working. But they require a cultural shift — from rewarding flashy significant results to rewarding rigorous, transparent science.

27.6 Research Ethics: Protecting Human Subjects

Spaced Review SR.2: Experimental ethics — informed consent and IRB (from Ch.4)

In Chapter 4, we introduced the ethical foundations of experimental research: informed consent (participants know what they're agreeing to), voluntary participation (they can leave at any time), and Institutional Review Board (IRB) oversight (an independent body evaluates the ethics of the study before it begins). We also discussed the Tuskegee Syphilis Study and the Facebook emotional contagion experiment.

Now let's go deeper.

The Tuskegee Legacy

The Tuskegee Syphilis Study (1932–1972) remains the defining case in American research ethics. Its legacy reshaped the entire system:

1974: National Research Act — Created the National Commission for the Protection of Human Subjects
1979: Belmont Report — Established three core principles:
Respect for persons — Individuals must be treated as autonomous agents; those with diminished autonomy deserve additional protection
Beneficence — Research must maximize benefits and minimize harms
Justice — The burdens and benefits of research must be distributed fairly
1981: Common Rule (45 CFR 46) — Federal regulations requiring IRB review for all federally funded human subjects research

These principles aren't just history. They're the foundation of every IRB application filed today.

The IRB Process

An Institutional Review Board (IRB) is a committee that reviews research proposals involving human subjects to ensure they meet ethical standards. Every university, hospital, and research institution that receives federal funding must have one.

IRBs evaluate: - Risk-benefit ratio: Do the potential benefits of the research justify the risks to participants? - Informed consent: Do participants understand what they're agreeing to? - Vulnerable populations: Are children, prisoners, pregnant women, or cognitively impaired individuals adequately protected? - Confidentiality: Is participant data properly protected? - Right to withdraw: Can participants leave the study without penalty?

Most statistical research falls into one of three categories:

Category	Description	Example
Exempt	Minimal risk, often uses existing data	Analyzing anonymized census data
Expedited	Slightly more than minimal risk	Surveys with non-sensitive questions
Full board review	Greater than minimal risk	Clinical trials, research with vulnerable populations

The Facebook Emotional Contagion Study: A Modern Ethical Crisis

In 2014, researchers at Facebook published a study in Proceedings of the National Academy of Sciences showing that they could manipulate users' emotions by changing their News Feeds. For one week in January 2012, Facebook altered the News Feed of 689,003 users — showing some users more positive posts and others more negative posts — and then measured whether the users' own posts became more positive or negative.

They did. The effect was small (Cohen's d ≈ 0.02) but statistically significant.

The backlash was immediate and fierce. The core ethical problems:

No informed consent. Users didn't know they were in an experiment. Facebook argued that its Terms of Service covered "research," but a Terms of Service agreement that no one reads is not meaningful consent.
Potential for harm. Deliberately increasing negative emotional content for hundreds of thousands of people — some of whom might be struggling with depression — is not trivial. Even a small effect, applied to 689,003 people, could push vulnerable individuals toward harm.
No IRB review. Facebook's internal review was perfunctory. The academic co-authors obtained IRB approval from Cornell, but only after the data was collected — a clear violation of the spirit (if not the letter) of the regulations.
Asymmetric power. Facebook users had no way to know they were in an experiment, no way to opt out, and no way to evaluate the risks.

Alex's Perspective: Alex Rivera knows this case well. At StreamVibe, the data science team runs hundreds of A/B tests per year. Some are clearly benign — testing button colors, headline wording, thumbnail images. But others manipulate the recommendation algorithm in ways that could affect what content users see for weeks. "Where's the line?" Alex asks. "If we test whether showing more educational content reduces churn, is that ethical? What if showing more outrage content reduces churn? The statistical methodology is identical. The ethics are completely different."

Data Privacy in the Age of Re-identification

Data privacy encompasses the rights of individuals to control how their personal information is collected, stored, used, and shared.

You might think anonymizing data is enough to protect privacy. It's not.

In 2006, Netflix released an "anonymized" dataset of 100 million movie ratings for a data science competition. The data had usernames replaced with random IDs. But researchers Narayanan and Shmatikov showed that they could re-identify individual Netflix users by matching their movie ratings with public IMDb reviews. With just a handful of rating matches, they could link anonymous Netflix profiles to real names.

Re-identification attacks have succeeded on: - Hospital discharge data — matched to voter registration records - Web browsing histories — matched to social media profiles - Genome data — relatives' DNA can identify "anonymous" donors - Location data — just four spatiotemporal points can uniquely identify 95% of people

Key Insight: "Anonymous" data is much harder to create than most people think. Any dataset with enough variables can potentially be linked to external information to identify individuals.

Two major laws now govern data privacy:

Feature	GDPR (EU, 2018)	CCPA (California, 2020)
Scope	Any organization handling EU residents' data	Businesses handling California residents' data
Right to access	Yes — individuals can request all data held about them	Yes
Right to deletion	Yes — "right to be forgotten"	Yes
Consent requirement	Opt-in (explicit consent required)	Opt-out (can request not to be sold)
Penalties	Up to 4% of global revenue	Up to $7,500 per intentional violation

These laws represent a fundamental shift: data about you belongs to you, not to the company that collected it.

27.7 The Ethics of Data-Driven Decision-Making

We've talked about how statistics can be misused. Now let's talk about the harder question: even when the statistics are correct, the decisions based on them can be ethically problematic.

Algorithmic Decision-Making in Criminal Justice

Professor James Washington has been studying this problem for years. The predictive policing algorithm used in Riverside County assigns risk scores from 1 to 10, predicting the likelihood of reoffending. These scores influence bail decisions, pretrial detention, and sentencing.

James's findings, which we've followed since Chapter 1: - The algorithm's risk scores predict actual recidivism with $R^2 = 0.85$ — a strong model - But the model works much better for white defendants ($R^2 = 0.91$) than for Black defendants ($R^2 = 0.73$) - At a risk score of 7 or higher, the false positive rate is 23% for white defendants but 44% for Black defendants - In the last two years, an estimated 340 Black defendants were detained pretrial based on over-predicted risk scores

The statistics are correct. The model is well-built by conventional standards. And yet the consequences of using it fall disproportionately on one racial group.

The Fundamental Question: Is it ethical to use a statistical model to make decisions about individual people's freedom when that model performs differently for different demographic groups?

This question has no easy answer. Here are the competing considerations:

Arguments for using the algorithm: - It's more consistent than individual judges, who also have biases - It provides a structured framework for decisions that were previously subjective - The overall accuracy is high - Not using data means falling back on intuition, which has its own biases

Arguments against: - The model systematically over-predicts risk for Black defendants - No individual should be punished because of statistical patterns in their demographic group - The training data reflects historical policing patterns, which themselves reflect structural racism - A tool that is 85% accurate overall can still be devastating for the 15% it gets wrong

Maya's Dilemma: Privacy vs. Public Good

Maya Chen is facing a different kind of ethical tension. She's analyzing a dataset that links neighborhood-level environmental exposures (air quality, water contamination, proximity to industrial facilities) to health outcomes (asthma rates, cancer incidence, birth defects). The data could reveal critical public health patterns that would justify regulatory action.

But the dataset is granular enough that some communities — particularly small rural communities with only a few hundred residents — could be identified. Publishing the data could: - Benefit: Provide evidence for environmental remediation, identify communities that need resources, hold polluters accountable - Harm: Stigmatize communities, reduce property values, expose individuals' health conditions, create anxiety without immediate solutions

Maya is caught between two goods: the public's right to know and the community's right to privacy.

Maya's Reflection: "The data shows that three neighborhoods near the Henderson Chemical plant have childhood asthma rates four times the county average. If I publish this, those families might get help. But I also know that if I identify those neighborhoods by name, every family there will see their property values drop. Is it ethical for me to make that decision for them?"

The Perspective-Taking Framework

When facing an ethical dilemma in data practice, consider all stakeholders:

Perspective-Taking Block: Stakeholders in Data-Driven Decisions

For any data-driven decision, identify:

The analyst — What are my incentives? Am I under pressure to find certain results?

The decision-maker — Who is using this analysis? What decisions will they make?

The subjects — Whose data is being analyzed? Did they consent? Can they be harmed?

The affected community — Who will be impacted by the decision? Were they consulted?

The absent voices — Who is not in the data? Whose perspective is missing?

Future users — How might this data or model be used in ways I didn't intend?

Work through this list for Maya's environmental health dilemma: - Maya (the analyst) wants to help but is aware of potential harm - The city council (the decision-maker) will use the data for regulatory decisions - The residents (the subjects) didn't consent to this specific analysis - The Henderson Chemical plant workers (the affected community) might lose their jobs if the plant is shut down - The children with asthma (the absent voices) didn't choose to live near a chemical plant - Future insurers (future users) might use the data to deny coverage to those communities

27.8 Ethical Frameworks for Data Practice

How do we make these decisions? Philosophy offers three major frameworks, each leading to different conclusions about the same dilemma.

Utilitarian Ethics: The Greatest Good

Utilitarianism asks: Which action produces the greatest good for the greatest number?

Applied to James's algorithm: - The algorithm reduces overall pretrial risk assessment errors compared to judicial discretion alone - But it produces more errors for one group than another - A utilitarian might accept the algorithm if the total number of correct decisions increases, even if some groups bear more of the cost

Applied to Maya's data: - Publishing the data could lead to remediation that helps thousands of people - The harm to specific communities (stigma, property values) affects hundreds - A utilitarian might publish, because the expected benefits outweigh the expected harms

The utilitarian problem: Who counts? Whose "good" matters? And who decides how to weigh one person's freedom against another person's safety?

Rights-Based Ethics: Individual Dignity

Rights-based (deontological) ethics asks: Does this action respect the fundamental rights of every individual?

Applied to James's algorithm: - Every defendant has a right to be judged as an individual, not as a member of a demographic group - Using a model that assigns higher risk scores based on factors correlated with race violates the principle of individual treatment - A rights-based ethicist might reject the algorithm even if it improves overall accuracy, because it violates individual dignity

Applied to Maya's data: - Community members have a right to privacy - They have a right to make informed decisions about their own health (which requires access to the data) - These rights conflict — and resolving the conflict requires balancing competing claims

The rights-based problem: When rights conflict, which right takes priority? The right to privacy or the right to health information?

Care Ethics: Relationships and Responsibility

Care ethics asks: What response best maintains trust, relationships, and responsibility to the most vulnerable?

Applied to James's algorithm: - The most vulnerable people in this system are defendants who are wrongly detained — especially those from communities historically harmed by the justice system - Care ethics would prioritize those specific voices rather than aggregate outcomes - It might lead to rejecting the algorithm, modifying it, or adding human oversight

Applied to Maya's data: - Care ethics would center the communities themselves: have they been consulted? Do they want the data published? What support structures are in place? - It would prioritize the children with asthma as the most vulnerable stakeholders - It might lead to publishing the data but only after community consultation and with support resources attached

No Framework Is Complete: Real ethical decisions in data practice rarely fit neatly into one framework. The best practitioners draw from multiple frameworks: use utilitarian reasoning to assess consequences, rights-based reasoning to protect individuals, and care ethics to center the most vulnerable.

Applying the Frameworks

Dilemma	Utilitarian	Rights-Based	Care Ethics
Predictive policing algorithm	Use if overall accuracy improves	Reject if it violates individual treatment	Modify to protect most vulnerable
Publishing health data	Publish if benefits > harms	Publish only with consent	Publish after community consultation
A/B testing emotions	Accept if user experience improves	Reject — users didn't consent	Reject — targets vulnerable users
Cherry-picking statistics	Wrong (misleads decision-makers, reducing social welfare)	Wrong (violates audience's right to accurate information)	Wrong (betrays trust)

27.9 Theme 5: Correlation vs. Causation as Ethical Imperative

Theme 5 Connection: Correlation ≠ Causation Is Not Just a Statistical Principle. It's an Ethical One.

Throughout this textbook, we've stressed that correlation does not imply causation. In Chapter 4, we introduced the concept. In Chapter 22, we gave it the full treatment with ice cream and drowning, shoe size and reading level, cheese consumption and engineering PhDs.

But here's what we haven't said explicitly enough: treating correlations as causal claims can hurt people.

When James finds that a neighborhood's crime rate correlates with its racial composition, and a policymaker says "this proves that race causes crime," that's not just bad statistics. It's a moral failure that reinforces structural racism.

When Maya finds that poverty correlates with poor health outcomes, and a politician says "poor people make bad health choices," that's not just a failure to consider confounders. It's an ethical violation that blames victims instead of addressing systemic causes.

When Sam reads an article claiming "violent video games cause aggression" based on a cross-sectional correlation of r = 0.15, and a school board bans video games, that's not just a misinterpretation of a weak association. It's a policy decision built on a logical error.

Correlation vs. causation is the ethical imperative of data literacy. Every time you present a correlation, you have a responsibility to be clear about what it does and doesn't prove.

27.10 Debate Framework: Is It Ethical to Use Data to Make Decisions About Individuals?

This is the central ethical question of our data-driven age. Let's debate it properly.

Structured Debate: Algorithmic Decision-Making About Individuals

Resolution: "It is ethical to use statistical models to make consequential decisions about individuals (e.g., bail, hiring, lending, insurance) when those models are more accurate than human judgment alone."

Side A (Pro): - Human decision-makers are subject to biases that algorithms can reduce - Models are transparent and auditable; human intuition is a black box - Statistical accuracy means fewer errors overall, even if some errors persist - Not using available data when it could improve outcomes is itself an ethical failure

Side B (Con): - Individuals have a right to be judged as individuals, not as members of statistical groups - "More accurate overall" can mask systematic unfairness to specific groups (see: James's algorithm) - Historical data encodes historical injustice; models trained on biased data perpetuate bias - Algorithmic decisions create an illusion of objectivity that makes unfairness harder to challenge

Side C (Nuanced): - It depends on the stakes. A movie recommendation is different from a bail decision. - It depends on the oversight. Models with human review differ from fully automated systems. - It depends on the alternatives. If the choice is between a biased algorithm and a biased human, the question is which bias can be measured, monitored, and corrected. - It depends on who's affected. Communities should have a voice in whether algorithms are used to make decisions about them.

Your Task: Choose a side. Write a 300-word argument. Then write a 200-word response to the strongest argument on the other side. The goal is not to "win" but to think carefully.

27.11 A Personal Code of Statistical Ethics

Here's the most important exercise in this chapter. Possibly the most important exercise in this textbook.

I want you to write a personal code of statistical ethics. Not a corporate policy. Not an abstract manifesto. A set of principles that you will follow when you work with data.

To get you started, here are some principles drawn from the American Statistical Association's Ethical Guidelines (2022), combined with the lessons of this chapter:

The ASA Ethical Principles (Adapted)

Professional integrity. Present statistical findings honestly. Don't cherry-pick, p-hack, or HARK.
Responsibility to subjects. Protect the privacy and well-being of people whose data you analyze. Obtain informed consent when possible. Consider re-identification risks.
Responsibility to science. Support reproducibility. Share data and code when possible. Report null results alongside significant ones.
Responsibility to funders and clients. Be clear about what the data can and cannot show. Don't oversell conclusions. Report limitations.
Responsibility to the public. Present data in ways that inform rather than mislead. Consider how your analysis might be misused by others.

Drafting Your Code

Use the following framework to draft your own code (aim for 8–12 principles):

My Statistical Ethics Code

When collecting data, I will... - [e.g., Obtain informed consent whenever possible] - [e.g., Be transparent about what data I'm collecting and why]

When analyzing data, I will... - [e.g., Pre-register my analysis plan for confirmatory research] - [e.g., Report all analyses I conducted, not just the significant ones]

When reporting results, I will... - [e.g., Include effect sizes and confidence intervals, not just p-values] - [e.g., Acknowledge limitations and alternative explanations]

When making decisions based on data, I will... - [e.g., Consider who might be harmed by my analysis] - [e.g., Seek perspectives from the communities affected by my work]

I will not... - [e.g., Cherry-pick data, time ranges, or subgroups to support a preferred conclusion] - [e.g., Present correlations as causal claims without evidence of causation]

This isn't just a class exercise. If you pursue any career that involves data — which is virtually every career in the 21st century — you will face situations where doing the right thing is harder than doing the expedient thing. Having thought about your principles before you're under pressure makes it more likely you'll follow them when it matters.

27.12 Python: Detecting Simpson's Paradox

Let's build a tool that detects potential Simpson's paradox in your data. This is the one technique for this chapter: a systematic check for aggregation effects.

import pandas as pd
import numpy as np

# ============================================================
# SIMPSON'S PARADOX DETECTOR
# Checks if a relationship reverses when stratified by a third variable
# ============================================================

def check_simpsons_paradox(df, outcome, group, stratify_by):
    """
    Check for Simpson's paradox by comparing aggregate and
    stratified relationships.

    Parameters:
    -----------
    df : DataFrame
        Data with columns for outcome, group, and stratifying variable
    outcome : str
        Name of the outcome column (numeric or binary 0/1)
    group : str
        Name of the group column (categorical with 2 levels)
    stratify_by : str
        Name of the stratifying variable (categorical)

    Returns:
    --------
    Dictionary with aggregate and stratified results, plus
    a flag indicating potential Simpson's paradox
    """
    results = {}

    # Aggregate comparison
    agg = df.groupby(group)[outcome].mean()
    groups = sorted(agg.index)
    agg_diff = agg[groups[1]] - agg[groups[0]]
    results['aggregate'] = {
        'means': agg.to_dict(),
        'difference': agg_diff,
        'favors': groups[1] if agg_diff > 0 else groups[0]
    }

    # Stratified comparison
    results['strata'] = {}
    reversal_count = 0
    for stratum in df[stratify_by].unique():
        subset = df[df[stratify_by] == stratum]
        strat_means = subset.groupby(group)[outcome].mean()
        strat_diff = strat_means[groups[1]] - strat_means[groups[0]]
        favors = groups[1] if strat_diff > 0 else groups[0]

        # Check if direction reverses
        if (agg_diff > 0 and strat_diff < 0) or \
           (agg_diff < 0 and strat_diff > 0):
            reversal_count += 1

        results['strata'][stratum] = {
            'means': strat_means.to_dict(),
            'difference': strat_diff,
            'favors': favors,
            'n': len(subset)
        }

    # Verdict
    total_strata = len(results['strata'])
    results['paradox_detected'] = reversal_count > total_strata / 2
    results['reversal_count'] = reversal_count
    results['total_strata'] = total_strata

    return results


def report_simpsons_check(results, outcome, group, stratify_by):
    """Print a human-readable report of the Simpson's paradox check."""
    print("=" * 60)
    print("SIMPSON'S PARADOX CHECK")
    print("=" * 60)

    agg = results['aggregate']
    print(f"\nAggregate: {group} comparison on '{outcome}'")
    for g, m in agg['means'].items():
        print(f"  {g}: {m:.4f}")
    print(f"  Difference: {agg['difference']:+.4f}")
    print(f"  Aggregate favors: {agg['favors']}")

    print(f"\nStratified by '{stratify_by}':")
    for stratum, info in results['strata'].items():
        direction = "REVERSED" if (
            (agg['difference'] > 0 and info['difference'] < 0) or
            (agg['difference'] < 0 and info['difference'] > 0)
        ) else "consistent"
        print(f"  {stratum} (n={info['n']}): "
              f"diff = {info['difference']:+.4f} "
              f"[{direction}]")

    print(f"\n{'⚠ SIMPSON\'S PARADOX DETECTED' if results['paradox_detected'] else '✓ No paradox detected'}")
    print(f"  Reversals: {results['reversal_count']} / "
          f"{results['total_strata']} strata")
    if results['paradox_detected']:
        print("\n  WARNING: The aggregate trend reverses in most strata.")
        print("  The aggregate statistic may be misleading.")
        print("  Report BOTH aggregate and stratified results.")


# ============================================================
# EXAMPLE: UC BERKELEY ADMISSIONS
# ============================================================

# Create the dataset
np.random.seed(2026)

admissions = pd.DataFrame({
    'Gender': (['Male'] * 825 + ['Female'] * 108 +
               ['Male'] * 560 + ['Female'] * 593 +
               ['Male'] * 325 + ['Female'] * 417 +
               ['Male'] * 417 + ['Female'] * 375),
    'Department': (['A'] * 933 + ['B'] * 1153 +
                   ['C'] * 742 + ['D'] * 792),
    'Admitted': (
        [1]*512 + [0]*313 +   # Male Dept A
        [1]*89  + [0]*19  +   # Female Dept A
        [1]*201 + [0]*359 +   # Male Dept B
        [1]*202 + [0]*391 +   # Female Dept B
        [1]*120 + [0]*205 +   # Male Dept C
        [1]*132 + [0]*285 +   # Female Dept C
        [1]*138 + [0]*279 +   # Male Dept D
        [1]*131 + [0]*244     # Female Dept D
    )
})

results = check_simpsons_paradox(
    admissions, 'Admitted', 'Gender', 'Department'
)
report_simpsons_check(results, 'Admitted', 'Gender', 'Department')

============================================================
SIMPSON'S PARADOX CHECK
============================================================

Aggregate: Gender comparison on 'Admitted'
  Female: 0.3709
  Male: 0.4563
  Difference: -0.0854
  Aggregate favors: Male

Stratified by 'Department':
  A (n=933): diff = +0.2044 [REVERSED]
  B (n=1153): diff = -0.0189 [consistent]
  C (n=742): diff = -0.0534 [consistent]
  D (n=792): diff = -0.0197 [consistent]

✓ No paradox detected
  Reversals: 1 / 4 strata

Note

With four departments, only one shows a reversal, so the detector doesn't flag it as a full paradox. The original Bickel et al. study used six departments, four of which showed women admitted at equal or higher rates. The detector is a starting point — always examine the stratified results even when no paradox is flagged.

Applying the Detector to Your Portfolio Project

# ============================================================
# PORTFOLIO APPLICATION
# Check for Simpson's paradox in YOUR dataset
# ============================================================

# Replace with your actual data and variable names
# results = check_simpsons_paradox(
#     your_data,
#     outcome='your_outcome_variable',
#     group='your_comparison_variable',
#     stratify_by='potential_confounder'
# )
# report_simpsons_check(
#     results,
#     'your_outcome_variable',
#     'your_comparison_variable',
#     'potential_confounder'
# )

# Questions to answer in your ethics section:
# 1. Could the aggregate trend in your data reverse at the subgroup level?
# 2. Who might be harmed if the aggregate data were used for decisions?
# 3. What confounders should a reader be aware of?

27.13 Theme 2: Human Stories Behind the Data

Theme 2 Connection: Every Data Point Is a Person

This theme has run through the entire textbook: from the missing race data in COVID-19 datasets (Chapter 7) to the 340 Black defendants wrongly detained by James's algorithm (Chapter 25). But it deserves a final, explicit statement here.

Statistics abstracts people into numbers. That's its power — it lets us see patterns that are invisible at the individual level. But that abstraction comes with a responsibility: to remember that the numbers represent real people with real lives.

When you calculate a false positive rate of 44%, remember that each false positive is a person who was detained when they shouldn't have been.

When you aggregate health data into a state-level statistic, remember that the "data points" are children with asthma, parents with cancer, communities with contaminated water.

When you drop outliers from your dataset, ask: who are those outliers? Are they the very people your analysis should be serving?

The best statisticians hold two truths simultaneously: data must be analyzed objectively, and data represents human lives that deserve dignity and care.

27.14 Progressive Project: Add an Ethics Section to Your Portfolio

It's time to add the most important section of your Data Detective Portfolio: the ethics section.

Project Checkpoint: Chapter 27

Add a new section to your portfolio titled "Ethics, Limitations, and Responsible Use." This section should include:

Part 1: Data Collection Ethics (400–500 words) - How was your dataset collected? Was informed consent obtained? - Who is represented in the data? Who is missing? - Could individuals be identified from the data, even if names are removed? - What biases might exist in the data collection process?

Part 2: Analysis Ethics (400–500 words) - Did you pre-specify your analysis plan, or did you explore the data first? - If you conducted exploratory analyses, label them as such - Report any analyses you tried that didn't work (null results) - Check for Simpson's paradox: could your main finding reverse at the subgroup level? - Run the check_simpsons_paradox() function on your primary comparison

Part 3: Reporting Ethics (300–400 words) - What are the limitations of your analysis? - How could your results be misinterpreted or misused? - Who could be harmed by your findings? - What would you want a reader to know before acting on your conclusions?

Part 4: Your Personal Code (8–12 principles) - Write a personal code of statistical ethics based on Section 27.11 - Apply at least three of your principles to specific decisions you made in your portfolio

Example: "My principle: 'Report all analyses, not just the significant ones.' In my portfolio, I tested three hypotheses. Only one was significant (p = 0.018). I reported all three, including the two null results."

27.15 What's Next?

This chapter asked you to think about statistics in a way that no formula can capture. You've learned about Simpson's paradox and the ecological fallacy — two ways that data can tell true stories that are deeply misleading. You've revisited p-hacking and HARKing with fresh eyes, seeing them not just as methodological problems but as ethical violations. You've applied ethical frameworks — utilitarian, rights-based, and care ethics — to real dilemmas in public health, criminal justice, and technology. And you've drafted a personal code that will guide your practice long after you've forgotten the formula for a z-test.

In Chapter 28, the final chapter, we'll step back and look at the full arc of your statistical journey. We'll revisit each major concept from Chapters 1–27, show how they connect, and point you toward what comes next — whether that's more statistics, data science, machine learning, or simply being a more informed citizen in a data-driven world.

You've come an extraordinary distance. One chapter remains.

Chapter 27 Summary: - Simpson's paradox shows that aggregated data can tell the opposite story from disaggregated data — always check both levels - The ecological fallacy warns against applying group-level statistics to individuals - P-hacking, HARKing, and cherry-picking are questionable research practices that distort the scientific record - Research ethics require informed consent, IRB oversight, and protection of privacy - Data privacy is harder to achieve than most people think — re-identification is a real risk - Ethical frameworks (utilitarian, rights-based, care ethics) provide different lenses for evaluating data-driven decisions - Correlation vs. causation is not just a statistical principle — it's an ethical imperative - A personal code of statistical ethics prepares you to make good decisions under pressure

Key Terms

Term	Definition
p-hacking	Trying multiple analyses until finding a statistically significant result; inflates false positive rates
HARKing	Hypothesizing After Results are Known — presenting post-hoc findings as if they were predicted
cherry-picking	Selecting data, time ranges, or subgroups that support a preferred conclusion while ignoring contradictory evidence
Simpson's paradox	A trend in aggregated data that reverses or disappears when the data is broken into subgroups
ecological fallacy	Drawing conclusions about individuals based on group-level data
data privacy	The rights of individuals to control how their personal information is collected, stored, used, and shared
informed consent	Participants' knowing agreement to participate in research after being told about risks, benefits, and procedures
IRB (Institutional Review Board)	A committee that reviews research proposals involving human subjects to ensure ethical standards are met
reproducibility crisis	The finding that many published scientific results cannot be replicated, driven by underpowered studies, p-hacking, and publication bias
open data	Making research data publicly available so others can verify, replicate, and extend findings

Practice	Description	Why It's Problematic
Optional stopping	Checking for significance repeatedly as data comes in and stopping when p < 0.05	Inflates Type I error rate far beyond the nominal α
Selective reporting	Running many analyses but reporting only the ones that "worked"	Same dataset can yield many contradictory stories
Rounding p-values	Reporting p = 0.054 as "p < 0.05"	Misrepresents the evidence
Dropping outliers post-hoc	Removing data points only because they hurt your results	Distorts the data to match your hypothesis
Flexible sample sizes	Collecting more data only when results aren't significant	Inflates false positive rates

Department	Men Applied	Men Admitted	Men Rate	Women Applied	Women Admitted	Women Rate
A (easy to get into)	825	512	62%	108	89	82%
B (hard to get into)	560	201	36%	593	202	34%

Prerequisites

Learning Objectives

In This Chapter

Chapter 27: Lies, Damn Lies, and Statistics: Ethical Data Practice

Chapter Overview

27.1 A Puzzle Before We Start (Productive Struggle)

27.2 Simpson's Paradox: When Aggregation Lies

The Formal Definition

Why It Matters Ethically

Simpson's Paradox in Medicine

Recognizing Simpson's Paradox

27.3 The Ecological Fallacy: The Danger of Group-Level Reasoning

A Concrete Example

Why the Ecological Fallacy Is an Ethical Issue

The Lesson

27.4 Lying with True Statistics: How to Deceive Without Fabricating

Cherry-Picking

Misleading with Denominators

Survivorship Bias Revisited

The Anatomy of a Misleading (But True) Claim

27.5 Questionable Research Practices: P-Hacking, HARKing, and the Gray Zone

P-Hacking: A Deeper Look

HARKing: Hypothesizing After Results Are Known

Other Questionable Research Practices

The Replication Crisis: A Wake-Up Call

27.6 Research Ethics: Protecting Human Subjects

The Tuskegee Legacy

The IRB Process

The Facebook Emotional Contagion Study: A Modern Ethical Crisis

Data Privacy in the Age of Re-identification

Legal Frameworks: GDPR and CCPA

27.7 The Ethics of Data-Driven Decision-Making

Algorithmic Decision-Making in Criminal Justice

Maya's Dilemma: Privacy vs. Public Good

The Perspective-Taking Framework

27.8 Ethical Frameworks for Data Practice

Utilitarian Ethics: The Greatest Good

Rights-Based Ethics: Individual Dignity

Care Ethics: Relationships and Responsibility

Applying the Frameworks

27.9 Theme 5: Correlation vs. Causation as Ethical Imperative

27.10 Debate Framework: Is It Ethical to Use Data to Make Decisions About Individuals?

27.11 A Personal Code of Statistical Ethics

The ASA Ethical Principles (Adapted)

Drafting Your Code

27.12 Python: Detecting Simpson's Paradox

Note

Applying the Detector to Your Portfolio Project

27.13 Theme 2: Human Stories Behind the Data

27.14 Progressive Project: Add an Ethics Section to Your Portfolio

27.15 What's Next?

Key Terms

Related Reading