> "Data is not just numbers. Data is people. Each data point represents a person, a story, a life."
Learning Objectives
- Define data, metadata, and datafication and explain how they relate to everyday life
- Identify at least five types of data generated by routine daily activities
- Distinguish between structured, unstructured, and semi-structured data
- Explain the data lifecycle from collection to deletion
- Analyze a real-world scenario to identify the data flows involved and their implications
- Articulate why data governance is a social and ethical concern, not merely a technical one
In This Chapter
- Chapter Overview
- 1.1 What Is Data?
- 1.2 The Datafication of Everything
- 1.3 Types and Structures of Data
- 1.4 The Data Lifecycle
- 1.5 Why Data Governance Matters
- 1.6 The Landscape Ahead
- 1.7 Chapter Summary
- What's Next
- Chapter 1 Exercises → exercises.md
- Chapter 1 Quiz → quiz.md
- Case Study: A Day in Data → case-study-01.md
- Case Study: The AOL Search Log Release → case-study-02.md
Chapter 1: The Data All Around Us
"Data is not just numbers. Data is people. Each data point represents a person, a story, a life." — Cathy O'Neil, Weapons of Math Destruction
Chapter Overview
It is 7:14 a.m., and you have already generated data hundreds of times.
Your phone's alarm went off, logging the time you woke up and how many times you hit snooze. Your smart thermostat registered that your bedroom reached 68 degrees and adjusted the HVAC system accordingly, noting the duration and energy cost. While brushing your teeth, your electric toothbrush recorded brush time and pressure. You checked your phone — three notifications, two swipes, one tap — each interaction catalogued by your operating system and the apps involved. Your weather app sent your precise GPS coordinates to a server to deliver a forecast. Your coffee maker, if it's a newer model, logged that you brewed 12 ounces of medium roast at 7:06 a.m.
None of this required your conscious participation. You did not sit down and fill out a form. You did not consent to each of these data transmissions individually. The data simply happened — generated as a byproduct of living in a networked world.
This chapter is about that phenomenon: the pervasive, largely invisible production of data that accompanies modern life. Before we can discuss who should govern data, what privacy means, or how algorithms shape society, we need to understand what data is, where it comes from, and why it matters that so much of it exists.
In this chapter, you will learn to: - Recognize the data you generate in everyday life — and the data generated about you without your action - Distinguish between different types and structures of data - Trace a data point from creation through storage, use, sharing, and eventual deletion - Understand why the sheer volume and variety of data creates social and ethical challenges that demand governance
1.1 What Is Data?
The word "data" comes from the Latin datum, meaning "something given." In its broadest sense, data is any representation of facts, concepts, or instructions in a form suitable for communication, interpretation, or processing. A temperature reading is data. A name is data. A photograph is data. The timestamp on this sentence — the moment you read it — could be data, if someone chose to record it.
But definitions only get us so far. To understand data as a social force, we need to think about it not as a static thing but as a dynamic process — something that is created, collected, stored, analyzed, shared, and acted upon. Each of those stages involves human choices, institutional interests, and power dynamics.
1.1.1 Data as Representation
At the most basic level, data represents something about the world. A hospital records a patient's blood pressure as "120/80 mmHg." A university records a student's grade as "B+." A social media platform records that you liked a post at 3:42 p.m. on a Tuesday.
Each of these is a representation — a translation of some aspect of reality into a storable, transmissible form. And every act of representation involves choices:
- What to measure. A hospital that records blood pressure but not housing status has made a decision about what matters. A university that records grades but not learning has made a similar one.
- How to categorize. When a form asks you to select a gender from "Male / Female / Other," it has imposed a classification system. When a police department records a stop as "suspicious behavior," it has applied a label with enormous consequences.
- What to leave out. Every dataset is a reduction. The patient's blood pressure reading doesn't capture their anxiety in the doctor's office. The student's grade doesn't capture what they actually learned. The "like" doesn't capture whether they were being ironic.
Intuition: Think of data as a photograph, not a window. A photograph shows you something real, but it's shaped by the photographer's choices — angle, framing, lighting, what's cropped out. Data works the same way. It shows you something real, but it's shaped by the choices of whoever designed the system that collected it.
1.1.2 Data vs. Information vs. Knowledge
These three terms are often used interchangeably in casual speech, but the distinctions matter:
| Concept | Definition | Example |
|---|---|---|
| Data | Raw facts without context | "37.2, 37.8, 38.1, 36.9" |
| Information | Data organized and contextualized | "Patient temperatures over 4 hours show a rising trend" |
| Knowledge | Information interpreted through experience and judgment | "This fever pattern, combined with other symptoms, suggests a bacterial infection requiring antibiotics" |
The transformation from data to information to knowledge requires human (or algorithmic) interpretation at each stage. And at each stage, errors, biases, and value judgments can enter.
Mira Chakravarti learned this distinction on her first day working in her university's Office of Institutional Research. Her supervisor handed her a spreadsheet with 40,000 rows of student enrollment data and said, "Tell me something useful." The data was there — course codes, timestamps, demographic fields, GPA columns — but without context, without questions, it was just rows and columns. It took Mira a week to understand the data well enough to ask the right questions of it.
"I thought data analysis was about finding answers," she told her friend Eli over coffee that week. "But it's really about figuring out what questions the data can actually answer — and what it can't."
1.1.3 Metadata: The Data About the Data
One of the most consequential categories of data is metadata — data that describes other data. When you send an email, the content of the email is data. The metadata includes: who sent it, who received it, when it was sent, from what IP address, using what device, how large the file was, and what the subject line said.
Metadata may sound mundane, but its power is extraordinary. In 2013, former NSA General Counsel Stewart Baker observed: "Metadata absolutely tells you everything about somebody's life. If you have enough metadata, you don't really need content."
Consider: knowing the content of a phone call tells you what two people discussed. Knowing the metadata — that a person called a suicide hotline at 2 a.m., that the call lasted 47 minutes, and that it was the third such call this month — tells you something arguably more intimate, without a single word of the conversation.
Real-World Application: In 2014, Stanford researchers conducted the "MetaPhone" study, collecting phone metadata from volunteers. From metadata alone — no call content — they could identify a participant who was growing marijuana (calls to a dispensary and a hydroponics store), another who had a heart condition and owned a firearm, and another who was likely pregnant. Metadata is not trivial. It is a detailed portrait of a life.
1.2 The Datafication of Everything
The concept of "datafication" — coined by Viktor Mayer-Schonberger and Kenneth Cukier — refers to the process of rendering into data aspects of life that were previously unquantified. Friendship becomes a social graph. Movement becomes GPS trails. Health becomes a stream of biometric readings. Attention becomes clickstream data.
This is not merely digitization (converting analog information to digital form). Datafication transforms qualitative human experiences into quantitative data points that can be tracked, aggregated, analyzed, and monetized.
1.2.1 A Day in Data
Let's trace a single day — a Wednesday in the life of a college student — and inventory the data generated:
Morning (6:00 a.m. - 9:00 a.m.) - Sleep tracker records sleep stages, duration, heart rate, blood oxygen levels - Phone alarm logs wake time; screen usage tracking begins - Smart meter records electricity usage spike (shower, lights, coffee maker) - Bathroom scale (if smart) records weight, BMI, body fat percentage - Streaming music service logs the playlist chosen, skip patterns, volume levels - GPS begins tracking movement from apartment to campus - Campus WiFi access point logs device MAC address and connection time - Dining hall card swipe records purchase time, items, cost - Campus security cameras capture face and movement
Midday (9:00 a.m. - 3:00 p.m.) - Learning management system (LMS) records login time, pages viewed, time on each page, quiz attempts - Campus library system logs book checkouts and database searches - Social media platforms record posts, likes, shares, scroll time, ad impressions - Text messages transit through carrier servers with metadata - Email system logs all send/receive metadata - Building access card logs which buildings entered and when
Evening (3:00 p.m. - midnight) - Ride-share app records pickup location, destination, route, driver rating - Payment card records purchases — amount, vendor, category, time, location - Streaming service logs what was watched, when viewing started, when it paused, what was skipped - Gaming platform records play sessions, in-game purchases, chat logs - Thermostat records temperature preferences and schedule - Phone screen time report calculates total usage, app-by-app breakdown - Sleep tracker begins new cycle
By a conservative estimate, this student has generated thousands of individual data points across dozens of systems controlled by different organizations — most without any deliberate action.
Reflection: Before reading on, take five minutes and list every device and service you've interacted with today. For each one, try to identify what data it likely collected. You may be surprised by the length of your list.
1.2.2 Data Exhaust
Much of the data described above is not the purpose of the interaction — it's a byproduct. You used your phone to check the weather; the GPS data was incidental. You swiped your dining card to buy lunch; the transaction metadata was secondary. This byproduct data is called data exhaust — information generated as a side effect of digital activities.
Data exhaust is consequential because:
- It's generated automatically. You can't avoid it without avoiding the technology entirely.
- It's often more revealing than primary data. Your browsing history may reveal more about your interests than anything you'd voluntarily disclose.
- It has economic value. Companies have built billion-dollar businesses on the collection and analysis of data exhaust. The entire digital advertising industry depends on it.
- It's rarely subject to meaningful consent. Most people don't know their data exhaust is being collected, let alone by whom or for what purpose.
Eli Okonkwo first encountered the concept of data exhaust when he learned that the Smart City sensors installed on lampposts in his Detroit neighborhood — officially deployed for "traffic optimization" — were also collecting ambient audio, WiFi probe requests from passing phones, and license plate numbers. The traffic data was the stated purpose. Everything else was exhaust. But the exhaust was being stored, analyzed, and shared with law enforcement.
"Nobody asked us," Eli said in Dr. Adeyemi's class. "Nobody asked if we wanted our phones tracked every time we walked down the street. They said it was about traffic. It's never just about traffic."
Dr. Adeyemi nodded. "And that," she replied, "is the first of many questions we'll spend this semester investigating. When someone says 'it's just data,' what are they not telling you?"
1.2.3 The Quantified Self
At the individual level, the most visible form of datafication is the quantified self movement — the voluntary use of technology to track personal metrics. Fitness trackers count steps and monitor heart rate. Apps track food intake, meditation minutes, mood fluctuations, menstrual cycles, and sleep quality.
The quantified self raises distinctive questions:
- Agency and coercion. Is tracking truly voluntary when your health insurance offers a discount for wearing a Fitbit? When your employer's wellness program requires it?
- Accuracy and interpretation. Consumer-grade trackers are often inaccurate. A step count that's off by 15% may seem minor, but a heart rate reading that's off by 15% could cause genuine harm if someone makes health decisions based on it.
- Data destination. When you track your sleep with an app, where does that data go? The company's servers? Third-party advertisers? Researchers? Insurers?
Common Pitfall: Many students assume that because they chose to install a fitness app, the data relationship is fair and transparent. But "choice" in this context is constrained by information asymmetry — you chose to install the app, but you almost certainly did not read the privacy policy that explains how your data will be used, shared, and retained. We'll examine this gap between formal consent and meaningful consent in Chapter 9.
1.3 Types and Structures of Data
Not all data is the same. Understanding the different types and structures of data is essential for understanding the governance challenges each presents.
1.3.1 Personal Data vs. Non-Personal Data
The most consequential distinction in data governance is between personal data and non-personal data:
- Personal data is any information relating to an identified or identifiable natural person. Your name, email address, Social Security number, IP address, and biometric data are all personal data. Under the EU's General Data Protection Regulation (GDPR), even a cookie ID that can be linked back to you counts as personal data.
- Non-personal data includes aggregated statistics, anonymized datasets (if truly anonymized), weather readings, and industrial sensor data that cannot be linked to a specific person.
The boundary between these categories is far less clear than it appears. Data that seems non-personal can often be re-identified. In 2006, AOL released "anonymized" search logs from 650,000 users, replacing names with numerical IDs. Within days, New York Times journalists identified User 4417749 as Thelma Arnold, a 62-year-old widow in Lilburn, Georgia, simply by analyzing her search patterns — searches for her own last name, her town, and medical conditions.
Connection: The challenge of re-identification is central to privacy engineering. In Chapter 10, we'll explore technical approaches like k-anonymity and differential privacy that attempt to make re-identification impossible — and the reasons they sometimes fail.
1.3.2 Structured, Unstructured, and Semi-Structured Data
| Type | Description | Examples | Governance Challenge |
|---|---|---|---|
| Structured | Organized in predefined formats (rows, columns, fields) | Database records, spreadsheets, transaction logs | Easier to search, audit, and regulate — but rigid categories can distort reality |
| Unstructured | No predefined format | Emails, social media posts, images, audio, video | Harder to govern, audit, or apply rules to — but often the richest data |
| Semi-structured | Some organizational properties but not a rigid schema | JSON files, XML documents, email headers | Falls between governance frameworks designed for structured or unstructured data |
An estimated 80-90% of the world's data is unstructured. This matters for governance because most data protection regulations were designed with structured data in mind — databases with clear fields like "name," "address," "date of birth." Governing unstructured data — a conversation captured by a smart speaker, a facial image in a crowd, a pattern of mouse movements on a website — is a fundamentally harder problem.
1.3.3 Sensitive Data Categories
Certain types of data receive heightened protection under most legal frameworks because of the harm their misuse can cause:
- Health data — Medical records, genetic information, mental health status
- Financial data — Bank accounts, credit scores, transaction histories
- Biometric data — Fingerprints, facial geometry, iris scans, voiceprints
- Location data — GPS coordinates, cell tower pings, IP geolocation
- Children's data — Any data collected from minors (typically under 13 or 16, depending on jurisdiction)
- Racial and ethnic data — Often prohibited from collection in some jurisdictions, required in others
- Political and religious data — Beliefs, affiliations, voting behavior
- Sexual orientation and gender identity data — Particularly sensitive given ongoing discrimination
The classification of data as "sensitive" is itself a governance decision with significant consequences. Mira noticed this when reviewing her university's data classification policy for the OIR: student GPA data was classified as "sensitive," but data about which students used campus mental health services was classified as merely "confidential" — a lower protection tier.
"Shouldn't the mental health data be more protected than GPA?" she asked her supervisor.
"You'd think so," her supervisor replied. "But the classification was written by IT, not by anyone who thought about it from the student's perspective."
1.4 The Data Lifecycle
Data doesn't just exist — it moves through stages, and governance challenges arise at every stage.
1.4.1 The Seven Stages
COLLECTION → STORAGE → PROCESSING → ANALYSIS → SHARING → RETENTION → DELETION
↑ ↓
└──────────────── (or back to collection: feedback loops) ─────┘
1. Collection: Data is gathered from individuals, devices, sensors, transactions, or public sources. Key questions: What is collected? With what consent? For what stated purpose?
2. Storage: Data is housed in databases, data lakes, cloud services, or physical media. Key questions: Where is it stored? Who has access? How is it protected?
3. Processing: Raw data is cleaned, transformed, and organized for use. Key questions: What is included or excluded? What categories are applied? What biases might processing introduce?
4. Analysis: Processed data is examined to extract patterns, generate predictions, or support decisions. Key questions: What methods are used? Who designed them? What assumptions are embedded?
5. Sharing: Data or analytical results are transmitted to other parties — business partners, government agencies, advertisers, researchers. Key questions: Who receives it? Under what terms? Can they re-share?
6. Retention: Data is kept for a specified or unspecified period. Key questions: How long? For what justification? Is the original purpose still valid?
7. Deletion: Data is (ostensibly) removed from systems. Key questions: Is it truly deleted or merely archived? Can it be recovered? Were all copies and backups addressed?
Common Pitfall: Many organizations treat deletion as the natural end of the lifecycle, but in practice, data is remarkably persistent. Backups, cached copies, data shared with third parties, and derived datasets can survive long after the "original" is deleted. True data deletion is an engineering challenge, not a simple button press.
1.4.2 The Lifecycle in Practice: VitraMed's Patient Data
Mira's father's company, VitraMed, provides a concrete illustration. When a patient visits a clinic using VitraMed's electronic health records system:
- Collection: The clinic records the patient's name, symptoms, vitals, insurance information, and treatment notes.
- Storage: VitraMed stores this data in cloud servers operated by a third-party provider (Amazon Web Services, in this case).
- Processing: VitraMed's software normalizes medical codes, links records across visits, and flags potential drug interactions.
- Analysis: VitraMed's predictive analytics module analyzes patient data to estimate health risks and suggest preventive care.
- Sharing: De-identified aggregate data is shared with public health researchers. Patient-level data may be shared with insurance companies for billing.
- Retention: VitraMed retains patient data for seven years after the last clinical visit, as required by HIPAA.
- Deletion: After seven years, data is queued for deletion — but derivative models trained on that data persist indefinitely.
"Wait," Eli said when Mira described this in class. "So the patient's data gets 'deleted' after seven years, but the patterns learned from their data live forever in the model? That doesn't sound like deletion to me."
It was the first of many times Eli would push Mira to think beyond the technical definition.
1.5 Why Data Governance Matters
By now, the scope of the challenge may be coming into focus. Data is everywhere. It's generated automatically. It flows through complex systems controlled by multiple organizations. It persists longer than most people realize. And it shapes decisions that affect people's lives — from the ads they see to the credit they receive to the medical treatments they're offered.
This is why data governance exists: because the gap between what data can do and what it should do is not self-correcting.
1.5.1 The Stakes
Consider what happens when data governance fails:
-
2017: Equifax breach. Personal financial data of 147 million Americans exposed due to an unpatched software vulnerability. The data included Social Security numbers, birth dates, and addresses — everything needed for identity theft. The company had known about the vulnerability for months.
-
2018: Cambridge Analytica. Personal data from 87 million Facebook users harvested without consent through a personality quiz app and used for political advertising targeting. The scandal revealed how data shared for one purpose (academic research) could be repurposed for another (political manipulation) with no meaningful oversight.
-
2020: Clearview AI. A facial recognition company scraped billions of photos from social media platforms without consent and sold its surveillance tool to law enforcement agencies. People who had posted a photo on Instagram had no idea it was being used to identify suspects in police investigations.
Each of these cases involved failures at specific points in the data lifecycle — collection without consent, storage without security, sharing without oversight, retention without limits. And each had consequences measured not in abstract data points but in human harm: stolen identities, manipulated elections, surveilled communities.
1.5.2 Beyond "I Have Nothing to Hide"
Mira's first real argument with Eli happened during the second week of Dr. Adeyemi's class. Mira's roommate, overhearing them discuss the course, had offered the most common response to privacy concerns: "I have nothing to hide, so I have nothing to worry about."
"That drives me insane," Eli said. "Nothing to hide from whom? The cops who patrol my neighborhood? The insurance company deciding my rates? The employer screening my social media?"
Mira, still in her pre-awakening phase, pushed back gently. "But isn't some of that data collection genuinely useful? VitraMed's analytics have caught early-stage conditions that doctors missed. My dad's company has literally saved lives."
"I'm not saying data is evil," Eli replied. "I'm saying the people whose data gets collected don't get to decide how it's used. That's the problem."
Dr. Adeyemi, who had been listening from the hallway, stepped into the conversation. "The 'nothing to hide' argument assumes that the only purpose of privacy is to conceal wrongdoing. But privacy also protects autonomy, dignity, intellectual freedom, and political dissent. We'll examine this in depth in Chapter 7. For now, I'll leave you with a question: if you have nothing to hide, would you be comfortable if I projected your complete search history on the classroom screen right now?"
The room went quiet.
Reflection: Consider the "nothing to hide" argument. Can you articulate at least two reasons why privacy matters even for people who are not engaged in any wrongdoing? Write your answer before reading on — we'll build on this question throughout the book.
1.5.3 Data as a Social Force
Data is not neutral. The act of collecting data, the categories we impose, the analyses we run, and the decisions we make based on those analyses all reflect and reinforce social structures.
When a loan algorithm trained on historical data denies credit to applicants from historically redlined neighborhoods, it is not "just following the data" — it is reproducing decades of discriminatory housing policy in a new technological form. When a hiring algorithm trained on past hiring decisions favors male candidates because the company historically hired mostly men, it is not discovering a truth — it is encoding a bias.
This is the central insight that this textbook will develop across 40 chapters: data systems are social systems. They are built by people, funded by institutions, governed (or not governed) by laws, and experienced by communities. Understanding them requires more than technical literacy. It requires ethical reasoning, historical awareness, and political imagination.
1.6 The Landscape Ahead
This opening chapter has introduced the raw material — data itself. The chapters that follow will build outward from here:
- Part 1 (Chapters 2-6) continues with foundations: the history of data and society, questions of data ownership, the attention economy, power dynamics, and ethical frameworks.
- Part 2 (Chapters 7-12) examines privacy: what it means, how surveillance works, how consent operates (and fails), and how specific types of sensitive data — health, genetic, biometric — present unique challenges.
- Part 3 (Chapters 13-19) tackles algorithmic systems and AI: bias, fairness, transparency, accountability, generative AI, and autonomous systems.
- Part 4 (Chapters 20-25) maps the governance landscape: global regulation, the EU AI Act, data governance frameworks, cross-border flows, sector-specific rules, and enforcement.
- Part 5 (Chapters 26-30) turns to corporate practice: building ethics programs, data stewardship, impact assessments, responsible AI, and crisis response.
- Part 6 (Chapters 31-37) broadens to society: misinformation, digital equity, labor, environmental ethics, children's vulnerability, national security, and Global South perspectives.
- Part 7 (Chapters 38-39) looks forward: emerging technologies and participatory design of data futures.
- Part 8 (Chapter 40) brings it home: your responsibility, from knowledge to action.
Throughout, we'll follow Mira and Eli as they navigate these questions — sometimes agreeing, sometimes arguing, always learning. We'll track VitraMed as it grows from a small EHR startup into a company confronting the full complexity of data governance. And we'll return again and again to the four themes that thread through this book:
- The Power Asymmetry — Who collects, who is collected upon, who decides.
- The Consent Fiction — The gap between the consent we give and the consent we think we give.
- The Accountability Gap — When data systems cause harm, who is responsible?
- The VitraMed Thread — How data ethics challenges compound as organizations grow.
1.7 Chapter Summary
Key Concepts
- Data is a representation of facts, shaped by choices about what to measure, how to categorize, and what to exclude.
- Metadata — data about data — can be as revealing as the data itself.
- Datafication transforms qualitative human experiences into quantitative data points.
- Data exhaust is the information generated as a byproduct of digital activities, often without the user's knowledge.
- The data lifecycle (collection → storage → processing → analysis → sharing → retention → deletion) creates governance challenges at every stage.
- Data is not neutral: it reflects and reinforces social structures.
Key Debates
- Is the "nothing to hide" argument a valid reason to accept pervasive data collection?
- Should data exhaust be treated as the property of the person who generated it or the company that collected it?
- Can the benefits of datafication (medical breakthroughs, efficiency gains, convenience) be realized without the harms (surveillance, discrimination, loss of autonomy)?
Applied Framework
When encountering any data system, ask: 1. What data is being collected? 2. By whom? 3. For what stated purpose? 4. What unstated purposes might it serve? 5. Who benefits? Who bears the risk? 6. What governance mechanisms exist — and are they adequate?
What's Next
In Chapter 2: A Brief History of Data and Society, we'll trace how the relationship between data collection and social power has evolved from ancient censuses through colonial statistics, Cold War computing, the internet revolution, and the Big Data era. Understanding this history is essential — because many of today's governance challenges have roots that stretch back centuries.
Before moving on, complete the exercises and quiz to solidify your understanding of the concepts introduced in this chapter.