Case Study 1: Maya's Public Health Data Dilemma — Privacy vs. Public Good
The Setup
Dr. Maya Chen is staring at her laptop in the county health department's conference room, and she can't decide what to do.
Over the past year, Maya has been building a comprehensive environmental health dataset for Riverside County. She's merged air quality sensor data, water testing results, industrial discharge permits, census demographics, and hospital records into a single dataset covering 47 communities across the county. The dataset is powerful — it links specific environmental exposures to specific health outcomes at the community level.
And it tells a deeply troubling story.
Three communities near the Henderson Chemical plant — Millbrook Heights, River Bend, and Oakdale Flats — have childhood asthma rates 4.1 times the county average. The correlation between proximity to the Henderson plant and respiratory illness is striking: $r = -0.89$ (where distance is negative — closer means more illness). After controlling for income, smoking rates, and access to healthcare, proximity to the plant remains a significant predictor of childhood asthma ($b = -2.7$, $p < 0.001$). Each mile closer to the plant is associated with 2.7 additional childhood asthma diagnoses per 1,000 children per year.
Maya's analysis also reveals an environmental justice dimension: the three most affected communities are disproportionately low-income and communities of color. Millbrook Heights is 72% Black and Hispanic, with a median household income of $34,200. The county average is $61,800.
The county's environmental health board has asked Maya to present her findings at next month's meeting. They want to use her data to apply for a federal environmental remediation grant — a $4.2 million opportunity that could fund air filtration systems in schools, community health screenings, and potentially force regulatory action against the Henderson plant.
There's just one problem.
The Ethical Tension
Maya's dataset is granular enough that publishing it — even without individual names — could effectively identify specific families.
Millbrook Heights has only 847 residents. The dataset includes age ranges, gender, race, and diagnosis codes. In a community that small, a record showing "Hispanic female, age 8-12, asthma diagnosis, 2024" could potentially be linked to a specific child. Maya knows from Latanya Sweeney's research that 87% of Americans can be uniquely identified by date of birth, zip code, and gender alone. Her data doesn't include exact birth dates, but it includes enough that re-identification is possible in small communities.
And re-identification could cause real harm:
- Stigma. If the communities are publicly identified as "contaminated zones," residents could face discrimination. Employers, landlords, and insurers might treat them differently.
- Property values. Real estate in these communities could decline, trapping residents who want to leave. Homeowners — many of whom are people of color who struggled to build equity — could lose their primary source of wealth.
- Anxiety. Parents in these communities would learn that their children have been breathing contaminated air for years. Some might panic. Some might take children to the ER unnecessarily, straining already limited healthcare resources.
- Political backlash. The Henderson Chemical plant employs 340 people in the county. Publishing data that could lead to regulatory action or plant closure could cost jobs and provoke community opposition.
On the other hand, not publishing creates its own harms:
- Continued exposure. Without public data, there's no political pressure for remediation. Children continue breathing contaminated air.
- Missed funding. The $4.2 million federal grant requires evidence of environmental health disparities — evidence that Maya's data provides.
- Justice delayed. The affected communities have a right to know about the environmental hazards in their neighborhoods. Withholding that information, even to protect them from stigma, is a form of paternalism.
Analyzing the Dilemma
Stakeholder Map
import pandas as pd
# ============================================================
# STAKEHOLDER ANALYSIS: Maya's Environmental Health Dilemma
# Mapping interests, potential benefits, and potential harms
# ============================================================
stakeholders = pd.DataFrame({
'Stakeholder': [
'Children with asthma (Millbrook, River Bend, Oakdale)',
'Parents in affected communities',
'Henderson Chemical plant workers (340 employees)',
'County environmental health board',
'Federal grant administrators',
'Maya (the analyst)',
'Property owners in affected communities',
'Insurance companies',
'Environmental advocacy groups',
'Henderson Chemical management'
],
'Primary Interest': [
'Health and safety',
'Children\'s health; property values; community identity',
'Job security and stable employment',
'Evidence-based policy; accountability',
'Data-driven funding decisions',
'Scientific integrity; protecting subjects',
'Property value; right to know',
'Accurate risk assessment',
'Environmental justice; regulatory action',
'Avoiding liability and regulation'
],
'If Published': [
'May get health interventions',
'Informed but anxious; property values may drop',
'Risk of plant closure / regulation',
'Strong grant application; regulatory basis',
'Can evaluate and fund remediation',
'Fulfills scientific duty; may cause harm',
'Values drop; but informed about risks',
'Can adjust premiums (problematic)',
'Strong evidence for advocacy',
'Faces regulatory pressure'
],
'If Not Published': [
'Continued exposure without intervention',
'Uninformed about risk to children',
'Jobs preserved in short term',
'Weak grant application; no data for policy',
'Cannot evaluate need',
'Avoids causing harm; fails duty to inform',
'Values stable; risk unknown',
'No change',
'No evidence for advocacy',
'No pressure to change'
]
})
print("=" * 70)
print("STAKEHOLDER ANALYSIS")
print("=" * 70)
for _, row in stakeholders.iterrows():
print(f"\n{row['Stakeholder']}")
print(f" Interest: {row['Primary Interest']}")
print(f" If published: {row['If Published']}")
print(f" If not: {row['If Not Published']}")
The Utilitarian Analysis
A utilitarian asks: which choice produces more total good?
Publishing (benefits): - Potential health improvements for ~2,500 residents in three communities - $4.2 million in federal remediation funding - Environmental justice for communities historically ignored - Long-term reduction in healthcare costs - Regulatory accountability
Publishing (harms): - Property value decline for ~400 homeowners - Anxiety and stigma for ~2,500 residents - Potential job losses for ~340 plant workers - Community conflict
Not publishing (benefits): - No property value decline - No anxiety or stigma - No job losses
Not publishing (harms): - Continued environmental exposure for ~2,500 residents, including ~600 children - No remediation funding - No accountability for the polluter - Perpetuates environmental injustice
A utilitarian calculus would likely favor publishing — the health benefits to thousands of people, including children, outweigh the financial harms to property owners and workers. But the utilitarian framework raises uncomfortable questions: Are we authorized to impose short-term financial harm on people for their long-term health benefit? Who bears the cost of this decision?
The Rights-Based Analysis
A rights-based ethicist asks: does each person's fundamental rights get respected?
Right to health information: Residents have a right to know about environmental hazards that affect their families. Withholding this information violates their autonomy.
Right to privacy: Residents have a right to keep their health conditions private. Publishing data that could identify specific individuals violates this right.
Right to informed consent: The health data was collected for clinical purposes, not for environmental research. Using it for a different purpose without consent is ethically problematic.
The rights-based analysis reveals a genuine conflict: the right to health information and the right to privacy pull in opposite directions. Neither right automatically trumps the other.
The Care Ethics Analysis
A care ethicist asks: what response best serves the most vulnerable?
The most vulnerable stakeholders are the children with asthma — they didn't choose where to live, they can't move on their own, and they're bearing the physical consequences of environmental contamination. A care ethics approach would center these children and ask: what do they need?
They need clean air. They need health screenings. They need treatment. And they need someone to advocate for them.
A care ethics approach would likely favor publishing — but with significant protections for the communities involved.
Maya's Decision
# ============================================================
# MAYA'S APPROACH: A MIDDLE PATH
# Publishing with protections
# ============================================================
print("=" * 70)
print("MAYA'S DECISION FRAMEWORK")
print("=" * 70)
decisions = [
("1. Community consultation FIRST",
"Before publishing anything, Maya meets with community leaders\n"
" in all three neighborhoods. She presents her findings privately,\n"
" explains both the potential benefits and risks, and asks:\n"
" 'What do you want the world to know about your community?'"),
("2. Data aggregation for privacy",
"Maya reports community-level statistics (asthma rates per 1,000)\n"
" rather than individual-level data. She combines age groups\n"
" and uses suppression rules: any cell with fewer than 10\n"
" individuals is replaced with '<10' to prevent identification."),
("3. Contextual publication",
"Maya publishes the data alongside context: the environmental\n"
" exposures, the correlation with the Henderson plant, and\n"
" the available interventions. She frames the communities as\n"
" 'harmed by environmental injustice' — not as 'contaminated.'"),
("4. Actionable recommendations",
"Every data point is paired with a specific recommendation:\n"
" air filtration for schools, free asthma screenings,\n"
" remediation timelines. The data serves a purpose, not\n"
" just a revelation."),
("5. Ongoing consent and transparency",
"Maya establishes a community advisory board with residents\n"
" from all three communities. Any future analyses using\n"
" their data require the board's input.")
]
for title, detail in decisions:
print(f"\n{title}")
print(f" {detail}")
print("\n" + "=" * 70)
print("KEY PRINCIPLE: Data about communities should serve those")
print("communities, not just describe them.")
print("=" * 70)
The Simpson's Paradox Connection
Maya's data also reveals a Simpson's paradox that deepens the ethical complexity.
When she looks at Riverside County's overall childhood asthma rate, it's 8.2% — slightly above the national average of 7.5% but not alarming. A county commissioner who saw only this aggregate number might say, "We're basically average. No crisis here."
But when Maya stratifies by neighborhood proximity to industrial facilities:
| Proximity | Communities | Children | Asthma Rate |
|---|---|---|---|
| Within 2 miles | 3 | ~600 | 33.7% |
| 2-5 miles | 8 | ~2,100 | 9.4% |
| 5+ miles | 36 | ~8,300 | 5.1% |
The aggregate 8.2% hides a crisis. The three communities nearest the Henderson plant have asthma rates four times higher than the county "average" — but their small population is overwhelmed in the aggregate by the much larger number of children in distant communities.
import numpy as np
# ============================================================
# SIMPSON'S PARADOX IN MAYA'S DATA
# Aggregate hides the environmental health crisis
# ============================================================
proximity = ['Within 2 miles', '2-5 miles', '5+ miles']
children = [600, 2100, 8300]
asthma_cases = [202, 197, 423]
rates = [c/n*100 for c, n in zip(asthma_cases, children)]
# Aggregate rate
total_children = sum(children)
total_cases = sum(asthma_cases)
agg_rate = total_cases / total_children * 100
print("=" * 55)
print("AGGREGATE VIEW: 'Everything looks fine'")
print("=" * 55)
print(f"County childhood asthma rate: {agg_rate:.1f}%")
print(f"National average: 7.5%")
print(f"Status: Slightly elevated but not alarming")
print(f"\n{'=' * 55}")
print("STRATIFIED VIEW: 'There's a crisis hiding in the average'")
print("=" * 55)
for p, n, c, r in zip(proximity, children, asthma_cases, rates):
print(f" {p:20s}: {r:5.1f}% "
f"({c:3d} cases / {n:,d} children)")
print(f"\n The aggregate {agg_rate:.1f}% masks a 33.7% rate in")
print(f" communities nearest to the Henderson plant.")
print(f"\n This is Simpson's paradox: the aggregate rate looks")
print(f" 'average' because the small, high-risk communities")
print(f" are overwhelmed by the larger, low-risk population.")
Discussion Questions
-
The paternalism problem. Maya initially considers not publishing to protect the communities from stigma. Is it paternalistic to withhold information from people "for their own good"? When is paternalism justified?
-
The consent gap. The health data was collected during routine clinical visits. Patients consented to treatment, not to having their data used in environmental research. Does Maya's use of the data violate their consent? Does the public health benefit justify the use?
-
The Simpson's paradox angle. If a county commissioner sees only the aggregate 8.2% asthma rate and concludes "no crisis," who is responsible for the resulting inaction? The commissioner for not asking for disaggregated data? Maya for not providing it? The system for allowing aggregate statistics to mask local crises?
-
Environmental justice. The affected communities are disproportionately low-income and communities of color. Does this change the ethical calculus? Should the history of environmental racism in the U.S. affect Maya's decision about what to publish?
-
The property value dilemma. If publishing data reduces property values in affected communities, who bears the cost? The residents? The polluter? The government? Is there a way to publish the data while mitigating the financial harm?
-
Your call. If you were Maya, what would you do? Write a 200-word justification that draws on at least two ethical frameworks and explicitly addresses the Simpson's paradox in the data.
Key Takeaways
- Public health data creates a genuine tension between privacy and the public good — there is no option that avoids all harm
- Simpson's paradox can hide environmental health crises inside reassuring aggregate statistics
- Community consultation transforms the ethics of publication from "decide for them" to "decide with them"
- Data about communities should serve those communities, not just describe them
- The ecological fallacy applies here too: the county-level asthma rate doesn't describe the experience of children in Millbrook Heights