Case Study 2: The Ethics of Scraping — When Public Data Isn't Really Public

Contributors to Introduction to Data Science

Case Study 2: The Ethics of Scraping — When Public Data Isn't Really Public

Tier 2 — Attributed/Composite Example: This case study is based on real events and legal cases in the web scraping domain, including hiQ Labs v. LinkedIn (2022), the Clearview AI controversies (2020-present), and various academic research ethics cases. The main character and specific narrative details are fictional, but the ethical dilemmas, legal references, and industry practices described are based on documented real-world situations.

The Setting

Elena — our public health analyst — has been asked to lead a new project at her public health department: tracking community sentiment about vaccination programs using social media data. Her director wants to understand how public attitudes toward childhood vaccination vary across neighborhoods, and whether negative sentiment on social media correlates with lower vaccination rates.

"Social media posts are public," her director says. "Anyone can see them. Just collect the posts that mention vaccines and analyze the sentiment."

Elena has the technical skills — Chapter 13 taught her how to use APIs and BeautifulSoup. But something about this project makes her uncomfortable, and she can't quite articulate why. So she does what a good data scientist does: she investigates before she acts.

Scenario 1: The "Public Data" Assumption

Elena starts by examining the assumption behind the project: that publicly posted social media data is free to collect and analyze.

She opens her notebook and writes a Markdown cell:

Question: Is publicly visible data the same as freely usable data?

She researches and finds that the answer is definitively "no." Here's what she discovers:

What "public" means technically: A social media post that anyone can view without logging in is technically public — there's no access control preventing you from reading it.

What "public" means legally: The legal status depends on jurisdiction, the platform's Terms of Service, and the nature of the data. The Terms of Service for most major social media platforms explicitly prohibit automated data collection (scraping) without permission. Violating ToS may not always be illegal, but it creates legal risk.

What "public" means ethically: When someone posts on social media, they're communicating with their perceived audience — friends, followers, their community. They're generally not consenting to have their posts collected, analyzed, and used in research by a government agency. The context of the communication matters, even if the access is unrestricted.

Elena writes this insight in her notebook:

Just because I can see the data doesn't mean the person who posted it intended for me to collect and analyze it at scale. There's a difference between a human reading a post and a script collecting thousands of posts.

Scenario 2: The Platform API Approach

Elena investigates whether social media platforms offer APIs for research. She finds that most major platforms do have APIs, but with important constraints:

Access tiers: Most platforms offer limited free access and expanded access for verified academic researchers
Rate limits: Strict limits on how much data you can collect per time period
Terms of use: Restrictions on how the data can be stored, shared, and used
Content policies: Prohibitions on re-publishing individual posts or identifying users

Elena applies for academic research access through the platform's official program. Her application requires: - A description of the research purpose - A data management plan (how she'll store and protect the data) - An ethics board review (even though her department isn't an academic institution, she seeks equivalent review) - A commitment not to attempt to identify individual users

This process takes three weeks. It would have been faster to just start scraping — but faster isn't the same as right.

Scenario 3: The Privacy Paradox

While waiting for API access, Elena thinks more deeply about the privacy implications. She draws a table in her notebook:

Scenario	Ethical?	Why?
Reading a public tweet about vaccines	Yes	Same as reading a newspaper letter to the editor
Manually noting themes from 50 public posts	Probably yes	Small-scale observation, similar to fieldwork
Scraping 100,000 posts with location data	Concerning	Scale changes the nature; location data enables identification
Correlating posts with health records by neighborhood	Alarming	Even aggregated, this could stigmatize communities
Publishing examples of "anti-vaccine sentiment" with usernames	Wrong	Individual posts taken out of context; could lead to harassment

Elena realizes that the ethical concern scales with volume, specificity, and potential for harm. What's acceptable for 50 posts becomes problematic for 100,000.

Scenario 4: Learning from the Clearview AI Case

Elena reads about Clearview AI, a company that scraped billions of photos from social media to build a facial recognition database sold to law enforcement. Key facts:

The photos were publicly visible
Clearview argued they had a First Amendment right to collect public information
Facebook, Twitter, YouTube, and LinkedIn all sent cease-and-desist letters
The Australian Information Commissioner found Clearview violated privacy law
The UK's Information Commissioner's Office fined Clearview over 7.5 million pounds
Illinois settled a lawsuit under the state's Biometric Information Privacy Act
Canada's privacy commissioner found Clearview violated Canadian privacy law

The pattern was clear: "publicly accessible" was not a defense. The scale of collection, the sensitivity of the data (biometric), the commercial use, and the lack of consent all contributed to the legal and ethical violations.

Elena writes:

The Clearview case establishes a principle I should follow: the scale and purpose of data collection matter as much as the accessibility of the data. My project is for public health, not commercial profit — but the same principles about consent and proportionality still apply.

Scenario 5: The hiQ Labs Decision

Elena also studies the hiQ Labs v. LinkedIn case (2022), which reached the opposite conclusion from Clearview in some respects:

hiQ Labs scraped public LinkedIn profiles to provide HR analytics
LinkedIn tried to block the scraping with cease-and-desist letters
The Ninth Circuit ruled that scraping publicly accessible data was not a violation of the Computer Fraud and Abuse Act (CFAA)
The court noted that public data doesn't become private just because a company wants to control it

But Elena notices important nuances: - The ruling was about the CFAA specifically — other laws (state privacy laws, copyright, ToS violations) may still apply - The case involved publicly accessible business profiles, not personal health opinions - The ruling didn't say scraping was ethical — only that it wasn't a federal computer crime

Legal ≠ Ethical. Just because something isn't a crime doesn't mean it's the right thing to do.

Scenario 6: Elena's Decision

After three weeks of research, Elena presents a revised proposal to her director:

What she recommends instead of scraping:

Use the platform's official Academic Research API with approved access. This respects the platform's terms and provides structured data within defined boundaries.
Aggregate, don't individualize. Analyze sentiment at the neighborhood or city level, never at the individual user level. Report trends, not specific posts.
Strip identifying information. Remove usernames, profile photos, and exact timestamps before analysis. Store only the text, a general location (city or ZIP code), and the date.
Don't correlate with health records directly. Instead of linking social media sentiment to individual-level health data, compare neighborhood-level sentiment averages with neighborhood-level vaccination rates. This preserves privacy while still answering the research question.
Get ethics review. Submit the project to an institutional review process, even if not formally required for non-academic research.
Document everything. Create a data ethics statement describing what data was collected, how it was obtained, what protections are in place, and what limitations apply.

Her director is initially frustrated — "You're making this much harder than it needs to be." But after Elena walks through the Clearview case and the potential consequences of unethical data collection (public backlash, legal liability, damaged community trust), the director agrees.

The Ethics Checklist in Practice

Elena formalizes her decision process into a checklist that her department adopts for future projects:

DATA COLLECTION ETHICS REVIEW

Project: ____________________________________________
Date: _______________________________________________
Reviewer: ___________________________________________

DATA SOURCE
[ ] Is there an official API or data access program?
[ ] Have we read and will we comply with the ToS?
[ ] Have we checked robots.txt (if scraping)?
[ ] Are we using the least invasive collection method?

PRIVACY
[ ] Does the data contain personal information?
[ ] Could individuals be re-identified from the data?
[ ] Have we minimized the data collected (only what's needed)?
[ ] Is the data stored securely with access controls?

CONSENT AND PURPOSE
[ ] Would the data subjects reasonably expect this use?
[ ] Is the purpose proportional to the privacy impact?
[ ] Have we sought ethics review or approval?
[ ] Have we documented our ethical reasoning?

POTENTIAL HARM
[ ] Could this data collection stigmatize communities?
[ ] Could individuals be harmed if the data leaked?
[ ] Could the analysis be used to discriminate?
[ ] Have we considered the power dynamics involved?

If any answer above raises concern, PAUSE and seek review.

What Elena Learned

"Public" is not binary. Data can be publicly accessible while still raising privacy concerns. The context, scale, and purpose of collection all matter.
Legal and ethical are different questions. Something can be legal but unethical (scraping personal data for surveillance), or technically a ToS violation but ethically defensible (academic research on public health trends). Both dimensions matter.
Slow down when you feel uncomfortable. Elena's initial discomfort was her ethical intuition working correctly. Instead of ignoring it, she investigated — and that investigation led to a better, more defensible approach.
The easy path isn't always the right path. Scraping would have been faster than applying for API access. But the slower approach was more ethical, more sustainable (API access is reliable; scraping breaks when websites change), and more defensible if anyone questions their methods.
Ethics is a skill, not a constraint. Good ethical reasoning doesn't just prevent harm — it leads to better research. Elena's revised approach produced higher-quality data (from an official API), stronger conclusions (with documented methodology), and greater community trust (because the department could demonstrate responsible practices).

Discussion Questions

Elena decided that analyzing 100,000 posts with location data was "concerning" but analyzing 50 posts manually was "probably okay." Where exactly does the line fall? Is there a principled way to determine when scale transforms an acceptable practice into a problematic one?
Your university asks you to scrape public Rate My Professor reviews to identify professors with consistently low ratings. The reviews are publicly posted by students who chose to share them. Is this ethical? What if the purpose is to provide additional teaching support to struggling instructors?
A non-profit health organization wants to monitor Twitter for mentions of disease symptoms to detect disease outbreaks early (syndromic surveillance). The posts are public, the purpose is clearly beneficial, and no individual would be identified. Does this change the ethical calculus compared to Elena's project? Why or why not?
Some argue that if you post something publicly on the internet, you've forfeited any expectation of privacy for that content. Others argue that contextual expectations matter — posting for your followers is different from posting for a data scientist's algorithm. Which position do you find more persuasive, and why?
If Elena's department had gone ahead with unethical scraping and the public found out, what consequences might follow — for the department, for public trust in health programs, and for the communities whose data was collected?