Chapter 4: Data — The Fuel That Powers AI (And Its Biggest Weakness)

Contributors to AI Literacy

28 min read

> "Data is not a mirror of reality. It is a portrait — painted by particular people, for particular purposes, with particular tools."

Learning Objectives

Explain why data quality determines AI system quality
Identify common sources of bias in training data
Distinguish between types of data (structured, unstructured, labeled, unlabeled)
Evaluate ethical implications of data collection practices
Analyze a dataset for potential representational gaps

In This Chapter

Opening Vignette
4.1 Where Data Comes From (And Who Creates It)
4.2 Types of Data: Structured, Unstructured, and Everything Between
4.3 Labels: The Human Judgment Hidden in "Objective" Data
4.4 When Data Goes Wrong: Bias, Gaps, and Ghost Data
4.5 Data Ethics: Consent, Ownership, and Power
4.6 The Hidden Workforce: Who Labels the Data?
4.7 Chapter Summary
Python Sidebar: Exploring a Dataset (Optional)
🗂️ AI Audit Report — Chapter 4 Checkpoint

Chapter 4: Data — The Fuel That Powers AI (And Its Biggest Weakness)

"Data is not a mirror of reality. It is a portrait — painted by particular people, for particular purposes, with particular tools." — Illustrative framing

Opening Vignette

In 2015, a software engineer posted a screenshot that went viral. Google Photos had automatically tagged a photo of two Black friends as "gorillas." Google's response was swift — they apologized and removed the "gorilla" label entirely. But here is the part that should trouble you: years later, rather than fixing the underlying problem, Google simply blocked the system from ever applying certain labels. The AI was not malicious. It was not "racist" in any human sense of the word. It had simply learned from a training dataset that contained far more photos of light-skinned faces than dark-skinned ones. The system learned the world as its data described it — and that description was incomplete.

This is the chapter where we look under the hood at the ingredient that makes every AI system run: data. In Chapter 3, you learned that machines learn by finding patterns. Now we need to ask a harder question: patterns in what? Where does the data come from? Who collected it? Who decided what to label and how? And what happens when the data is missing, biased, or just plain wrong?

The answers to these questions matter more than you might expect. They are often the difference between an AI system that helps people and one that harms them.

4.1 Where Data Comes From (And Who Creates It)

Every AI system starts with data. Without data, machine learning has nothing to learn from — no patterns to detect, no relationships to model, no predictions to make. But where does all this data actually come from?

The answer is: from us. From you, more often than you might realize.

The Data Supply Chain

Think of AI data as having a supply chain, much like the food you eat. Your breakfast cereal does not materialize on the grocery shelf. Farmers grew the grain, workers harvested it, factories processed it, trucks transported it, and grocery staff stocked the shelves. Each step involves decisions — what to grow, where to source, how to process — and each step can introduce problems.

Data works the same way. Here is a simplified version of the data supply chain:

Generation — Data is created through human activity, sensors, transactions, or deliberate collection.
Collection — Someone decides what data to gather, from whom, and how.
Storage — Data is organized, formatted, and housed somewhere.
Curation — Data is cleaned, selected, and prepared for use.
Labeling — Humans (or other AI systems) tag data with categories the AI needs to learn.
Training — The data is fed into a machine learning system.

At every single step, human choices shape what the AI will learn. And those choices are not neutral.

Where the Big Datasets Come From

Let us get specific. Here are some of the most common sources of AI training data:

User-generated content. Every time you post a photo on social media, write a product review, or send a message, you may be creating data that ends up training an AI. Large language models like GPT-4 were trained on massive collections of internet text — Reddit posts, Wikipedia articles, books, news sites, and more. Image generators like DALL-E learned from millions of images scraped from the web, many of them posted by ordinary people who had no idea their photos would be used this way.

Institutional records. Hospitals have decades of patient records. Courts have sentencing data. Banks have loan applications. Schools have student performance records. All of this institutional data reflects the decisions that institutions have made over time — including discriminatory ones.

Sensor and IoT data. Security cameras, fitness trackers, smart thermostats, and traffic sensors all generate continuous streams of data. Your phone's GPS creates a detailed map of everywhere you go.

Synthetic data. Increasingly, AI companies generate artificial data using other AI systems. This can help fill gaps, but it introduces its own risks — if the generator has biases, the synthetic data will too.

Deliberately collected data. Sometimes organizations run studies, surveys, or crowdsourcing campaigns specifically to gather training data. This is often more controlled, but it is also expensive, which means it is less common than simply scraping what already exists.

💡 Key Insight: The cheapest data to collect is the data that already exists — internet text, institutional records, user activity logs. But this data was created for purposes completely different from AI training. A medical record was written to treat a patient, not to train a diagnostic algorithm. That mismatch matters.

ContentGuard's Data Sources

Consider ContentGuard, the social media content moderation system we introduced in Chapter 1. What data does it need to learn? It needs millions of examples of social media posts, each labeled with a judgment: "this is hate speech," "this is acceptable," "this is spam," "this is violent content."

But where do those judgments come from? They come from human reviewers — people who look at posts and decide which category each one falls into. Those reviewers bring their own cultural backgrounds, language competencies, and personal thresholds for what counts as "offensive." A joke that is clearly sarcasm to someone steeped in American internet culture might look like a sincere threat to someone from a different cultural context. The data is not just posts; it is posts plus human judgment about those posts. And that judgment is never perfectly objective.

🔁 Spaced Review — Chapter 1 Connection In Chapter 1, we defined AI as a spectrum of techniques rather than a single technology. How does understanding the data supply chain reinforce that idea? Think about it: different AI techniques need different kinds of data. A spam filter needs labeled emails. A self-driving car needs millions of annotated road images. A language model needs vast text corpora. The data requirements are as diverse as the techniques themselves.

4.2 Types of Data: Structured, Unstructured, and Everything Between

Not all data looks the same. Understanding the different types of data will help you evaluate what an AI system is actually working with — and what it might be missing.

Structured Data

Structured data is the neat, orderly kind. It lives in spreadsheets and databases, organized into rows and columns with clearly defined categories. Think of a hospital patient database:

Patient ID	Age	Sex	Blood Pressure	Diagnosis	Outcome
10042	67	F	142/88	Pneumonia	Recovered
10043	34	M	118/76	Appendicitis	Recovered
10044	51	F	155/95	Heart failure	Deceased

Each field has a defined type (number, text, category), and the relationships between fields are clear. Structured data is what traditional statistics was built to handle. It is organized, searchable, and relatively easy to analyze.

But notice what this table cannot capture: the patient's description of their symptoms in their own words, the doctor's handwritten notes, the X-ray image, the tone of voice in which a patient said "I feel fine." All of that is unstructured data.

Unstructured Data

Unstructured data is everything that does not fit neatly into rows and columns. It includes:

Text: Emails, social media posts, news articles, books, medical notes
Images: Photographs, satellite imagery, medical scans, diagrams
Audio: Speech recordings, music, phone calls, podcasts
Video: Security footage, YouTube videos, surgical recordings

Here is a striking statistic: by most estimates, over 80% of the world's data is unstructured. And much of the recent AI revolution has been about learning to work with unstructured data. The breakthroughs in image recognition, natural language processing, and speech recognition that we discussed in Chapter 2 are all fundamentally about teaching machines to find patterns in messy, unstructured data.

Labeled vs. Unlabeled Data

This distinction cuts across the structured/unstructured divide. Labeled data has been tagged with the answer the AI is supposed to learn. An email labeled "spam" or "not spam." A chest X-ray labeled "pneumonia present" or "no pneumonia." A social media post labeled "hate speech" or "acceptable."

Unlabeled data has no such tags. It is just raw information — a pile of photos with no captions, a collection of medical records with no diagnoses marked, a million product reviews with no sentiment scores.

Remember from Chapter 3: supervised learning needs labeled data; unsupervised learning works with unlabeled data. This is why the distinction matters so much. Supervised learning, which powers most AI applications you interact with daily, depends entirely on the quality and accuracy of its labels.

And labels, as we are about to see, are one of the most underappreciated sources of bias in AI systems.

✅ Check Your Understanding A hospital wants to build an AI that reads radiology scans and flags potential tumors. What type of data is a radiology scan? (Unstructured — it is an image.) What would the labels be? (Annotations from radiologists marking regions as "tumor" or "no tumor.") What kind of learning would this system use? (Supervised learning — recall Chapter 3.)

4.3 Labels: The Human Judgment Hidden in "Objective" Data

Here is a sentence that might seem obvious but has profound implications: someone had to decide what the labels mean.

When you train a content moderation system like ContentGuard, someone had to define "hate speech." Is it only explicit slurs? Does it include dog whistles — coded language that insiders understand but outsiders might miss? What about satire that mocks bigotry by imitating it? What about a reclaimed slur used within a community? Every edge case requires a human judgment call, and those calls get baked into the training data.

The Labeling Process

Data labeling, also called annotation, is the process of attaching categories, tags, or judgments to raw data so that a supervised learning system can learn from it. It is one of the most important and least glamorous parts of the AI pipeline.

Here is how labeling typically works:

An organization defines a set of categories (called a taxonomy or labeling schema). For ContentGuard, this might include: hate speech, harassment, spam, misinformation, violent content, adult content, acceptable.
Written guidelines explain what each category means, with examples and edge cases.
Human labelers review each data point and assign it to one or more categories.
Quality checks compare labelers' decisions to see if they agree (this is called inter-annotator agreement).
Disagreements are resolved — sometimes by majority vote, sometimes by an expert reviewer, sometimes by simply discarding the ambiguous example.

Step 5 should give you pause. Discarding ambiguous examples means the AI never learns about the hard cases — the very cases where getting it right matters most.

When Labels Encode Values

Consider MedAssist AI, our hospital diagnostic system. Suppose you are training it to predict which emergency room patients are at high risk of serious illness. You need labeled data: past cases where doctors documented whether a patient was "high risk" or "low risk."

But here is the problem: research has repeatedly shown that physicians tend to underestimate pain in Black patients compared to white patients with identical symptoms. A well-documented study published in Proceedings of the National Academy of Sciences (Hoffman et al., 2016) found that a significant proportion of white medical students and residents held false beliefs about biological differences between Black and white people — such as believing that Black people have thicker skin or less sensitive nerve endings. These beliefs correlated with less accurate pain treatment recommendations.

If MedAssist AI trains on historical medical records, it inherits those biased assessments. The labels — "high risk" versus "low risk" — are not objective truths about patients. They are records of what doctors believed about patients, which is a very different thing.

⚠️ Threshold Concept: Data Is Never Neutral This is one of the most important ideas in this entire book. Data does not simply reflect reality — it reflects the priorities, assumptions, biases, and blind spots of the people who created it. A crime dataset does not show where crime happens; it shows where police looked for crime and made arrests. A hiring dataset does not show who is qualified; it shows who got hired under a system that may have discriminated. A medical dataset does not show who is sick; it shows who had access to healthcare and received a diagnosis.

Data is never neutral. It encodes the world that created it. Once you see this, you cannot unsee it — and you will never evaluate an AI system the same way again.

The Ground Truth Problem

In AI, ground truth refers to the correct answer that a model is trying to learn. But as we have just seen, ground truth is often not a fact about the world — it is a human judgment, an institutional decision, or a social construction.

This does not mean all labels are equally unreliable. A label that says "this photo contains a cat" is pretty straightforward (though even here, what about a drawing of a cat? A person in a cat costume?). But a label that says "this loan applicant is creditworthy" or "this social media post is harmful" involves complex value judgments that reasonable people can disagree about.

The more subjective the label, the more human judgment is embedded in the training data, and the more carefully you should scrutinize an AI system built on that data.

🔁 Retrieval Prompt Pause and reflect: Name three types of data we have discussed so far (structured, unstructured, labeled, unlabeled). For each, give an example not mentioned in this chapter. Then explain in one sentence why the distinction between labeled and unlabeled data matters for machine learning.

4.4 When Data Goes Wrong: Bias, Gaps, and Ghost Data

Now that you understand where data comes from and what labels really are, we can tackle the big question: what happens when data goes wrong? The answer, unfortunately, is: a lot.

The Bias Pipeline

Bias can enter the data at every stage of the supply chain we described in Section 4.1. Here are the most common types:

Selection bias occurs when the data is collected from a non-representative sample. If you build a facial recognition system using photos primarily from dating websites, your training data will skew toward certain age ranges, attractiveness norms, and demographics. If you train a hiring AI on data from a single company's past employees, you will replicate that company's hiring patterns — including its biases.

Historical bias occurs when the training data accurately reflects a world that was itself unjust. U.S. mortgage lending data from the 20th century reflects decades of redlining — the systematic denial of loans to residents of predominantly Black neighborhoods. An AI trained on this data would learn that geography (often a proxy for race) is a strong predictor of creditworthiness. The data is technically accurate about what happened. But what happened was discriminatory.

Measurement bias occurs when the way data is collected systematically distorts certain groups' representation. Consider CityScope Predict, the predictive policing system. Crime data comes from police reports. But policing is not distributed evenly: neighborhoods with heavier police presence generate more crime reports, not necessarily because more crime occurs there, but because more crime is detected and recorded there. The data measures policing intensity as much as it measures crime.

Aggregation bias occurs when a model treats distinct groups as if they are the same. A diabetes risk model trained on data that combines multiple ethnic groups may perform well on average but poorly for any individual group, because diabetes presents differently across populations.

Ghost Data: The Bias of Absence

One of the most insidious forms of data bias is not what is in the data but what is missing from it. We call this ghost data — information about people, places, or phenomena that were never recorded in the first place.

Consider this: if you are building an AI system to detect skin cancer from photographs, you need photos of skin conditions on many different skin tones. But dermatology textbooks and research databases have historically overrepresented light skin. A study published in the Journal of the American Academy of Dermatology found that dark skin was represented in a small minority of images across major textbook resources. The result: AI skin cancer detection tools perform significantly worse on darker skin.

The people who are missing from the data do not show up as an error. They show up as nothing. The system simply does not know what it does not know.

📊 Research Spotlight: The ImageNet Problem ImageNet, one of the most influential datasets in AI history, contains over 14 million labeled images and has been used to train many of the image recognition systems that power everyday applications. Researchers Kate Crawford and Trevor Paglen conducted an investigation of ImageNet's "person" categories and found deeply troubling labels. People in photos had been tagged with terms like "bad person," "call girl," "drug addict," and "alcoholic" — labels applied by crowdworkers based solely on a photograph. Crawford and Paglen's work (published as part of the "Excavating AI" project) exposed how human biases, stereotypes, and moral judgments get encoded into supposedly neutral training data. In response, ImageNet's creators removed hundreds of thousands of images from the person categories. But the models already trained on this data continue to circulate.

CityScope Predict: A Feedback Loop in Action

Let us return to CityScope Predict to see how data bias does not just reflect the past — it can actively shape the future.

Here is the cycle:

Historical crime data shows more arrests in neighborhoods A and B (which happen to be lower-income, predominantly Black and Latino communities).
CityScope Predict is trained on this data and learns that neighborhoods A and B are "high crime areas."
The system recommends that police send more patrols to neighborhoods A and B.
More patrols lead to more arrests (for the same rate of underlying activity).
The new arrest data confirms and strengthens the original pattern.
Return to step 2. The cycle repeats.

This is a feedback loop — the AI's predictions become self-fulfilling prophecies. The data bias is not a one-time problem; it is an ongoing engine that amplifies existing disparities. We will explore feedback loops in much greater depth in Chapter 7 (AI Decision-Making) and Chapter 9 (Bias and Fairness), but the important thing to understand now is that biased data does not just produce a biased snapshot. It produces a biased system that generates more biased data over time.

🔴 Productive Struggle Consider this scenario: You are building an AI system to recommend candidates for job interviews. You train it on ten years of your company's hiring data. The company has 500 employees, 85% of whom are men. The AI learns that being male is a strong predictor of being hired.

Should you remove gender from the training data? Before reading on, take two minutes to think about why that might — or might not — solve the problem.

The struggle: Even if you remove the "gender" field, the AI can often reconstruct gender from proxy variables — names, college affiliations, participation in certain extracurricular activities, even word choices in resumes. Amazon discovered exactly this problem with an AI recruiting tool they developed in 2014 and ultimately scrapped. Removing a variable is not the same as removing its influence. This is one reason data bias is so difficult to fix.

So far, we have focused on data quality — is it accurate, representative, and fair? But there is an equally important set of questions about data ethics: is the data legitimately collected? Did people consent? Who owns it? Who profits from it?

When researchers at Clearview AI scraped billions of photos from social media profiles to build a facial recognition database, did the people in those photos consent to having their faces used for surveillance technology? Almost certainly not. They posted selfies for their friends, not for a company that would sell their biometric data to law enforcement agencies.

This pattern is pervasive. Most of the text used to train large language models was written by people who had no idea it would be used for AI training. Artists whose work was included in image generation training sets did not consent to having their styles replicated by AI. Patients whose medical records train diagnostic systems often signed broad consent forms that technically allowed research use but never imagined AI applications.

The legal frameworks are struggling to keep up. The European Union's General Data Protection Regulation (GDPR) requires specific, informed consent for data processing and gives people the right to have their data deleted. But enforcing these rights against massive, already-trained AI systems is a legal frontier that courts are still navigating.

Data as Power

Here is a way of thinking about data that goes beyond individual consent: data collection is an act of power. The entities that collect the most data — tech companies, governments, large institutions — are also the entities that build and control AI systems. The people whose data is collected — users, patients, citizens, workers — often have little say in how that data is used.

This power dynamic matters. When a hospital system deploys MedAssist AI, patients in the system's network have their data used to train and improve the tool. Those patients may benefit from better diagnostics. But the hospital also benefits from improved efficiency and reputation. And the AI company that built MedAssist AI benefits from having a more capable product to sell to other hospitals. The distribution of benefits is uneven, and the distribution of control is even more uneven.

⚖️ Ethical Analysis: Who Benefits from Data Collection? For any AI system, you can ask four questions about data ethics:

Consent: Did the people whose data was used know about and agree to its use?

Benefit: Who benefits from the system the data enables? Do data subjects share in those benefits?

Control: Who decides how the data is used? Can data subjects opt out?

Risk: Who bears the risk if the data is misused, breached, or produces harmful outcomes?

Try applying these four questions to ContentGuard. The people whose posts are used to train the content moderation system are also the people who will be subject to its judgments. But they had no say in the training process, and they often have limited ability to appeal automated decisions about their content.

Data Provenance

Data provenance — the documented history of where data came from, how it was processed, and who handled it — is increasingly recognized as essential for responsible AI. Just as you might want to know where your food was grown, how it was processed, and what chemicals were used, AI practitioners need to know where their training data originated.

Good data provenance documentation answers questions like:

Who collected this data, and when?
What was the original purpose of collection?
What population does it represent? Who is missing?
How were labels assigned? By whom?
What preprocessing was applied? Were any records removed?
What are the known limitations?

Some researchers have proposed datasheets for datasets — standardized documentation that accompanies training data, analogous to nutrition labels on food. This concept, proposed by Timnit Gebru and colleagues in a 2021 paper, would require dataset creators to answer a structured set of questions about their data's composition, collection process, recommended uses, and limitations. The idea has gained traction but is far from universal practice.

✅ Check Your Understanding What is the difference between selection bias and historical bias? Give a real-world example of each that we have not already discussed in this chapter. (Hint: Think about data from hiring, lending, education, or healthcare systems you are familiar with.)

🔁 Retrieval Prompt We have now covered five major forms of data bias: selection bias, historical bias, measurement bias, aggregation bias, and ghost data. Without looking back, try to define each one in your own words and give one example. Which one do you think is most difficult to detect? Why?

4.6 The Hidden Workforce: Who Labels the Data?

There is one more dimension of data that rarely makes headlines but deserves serious attention: the people who do the labeling.

The Scale of Data Labeling

Consider what it takes to train a system like ContentGuard. Millions of social media posts need to be reviewed and categorized. Each post might need multiple reviewers to ensure consistency. For a large platform, that is tens of millions of labeling decisions.

Who does this work?

In many cases, it is done by workers in the Global South — in Kenya, the Philippines, India, Venezuela — through outsourcing companies that pay between $1 and $3 per hour. These workers sit at computers for hours, reviewing content that is often disturbing: graphic violence, child exploitation material, hate speech, and more.

A 2023 investigation by TIME magazine revealed that workers in Kenya employed by a company called Sama, which was contracted by OpenAI to label data for ChatGPT's safety filters, were paid less than $2 per hour to read and categorize descriptions of graphic content, including sexual abuse, violence, and self-harm. Many workers reported lasting psychological trauma from the work.

This labor is essential to the AI systems that billions of people use every day. Without it, language models could not learn what content to refuse. Content moderation systems could not learn what posts to remove. Medical AI could not learn what a tumor looks like in a scan. Yet the workers who make this possible are largely invisible.

The Annotation Economy

The global data labeling market is worth billions of dollars and growing rapidly. It encompasses a range of work, from the disturbing content review described above to more routine tasks like drawing bounding boxes around objects in photos, transcribing audio, or sorting images into categories.

Some of this work is done through crowdsourcing platforms like Amazon Mechanical Turk (named, with an irony that seems lost on Amazon, after a famous 18th-century chess-playing automaton that turned out to have a human hidden inside). On these platforms, "requesters" post small tasks — called HITs, or Human Intelligence Tasks — and "workers" complete them for small payments, often pennies per task.

The working conditions vary enormously. Some data labeling jobs are relatively well-paid full-time positions at established companies. But much of the work is precarious gig labor with no benefits, no stability, and no recourse if a requester rejects your work without paying.

🔵 Perspective-Taking: Inside the Labeling Room Imagine you are a content moderator in Nairobi, working for a company that has contracted with a major AI firm. You earn about $1.50 per hour. Today, you will review hundreds of social media posts and classify each one. Some contain hate speech. Some contain graphic violence. A few contain images of child exploitation. You are required to view these images carefully enough to categorize them correctly.

Your work will directly train an AI system used by hundreds of millions of people. That system will be described in press releases as "cutting-edge technology." Your role will not be mentioned.

How does knowing this change the way you think about AI systems and their claims of being "automated"? What obligations, if any, do the companies that use this labor have to the workers who provide it?

Why This Matters for AI Literacy

You might wonder: why include a section about labor conditions in a chapter about data? Because understanding data requires understanding the full system that produces it. Data does not emerge from algorithms; it emerges from human activity, human judgment, and human labor. When we talk about an AI system as if it is a purely technical achievement, we erase the people who made it possible — from the engineers who designed it to the labelers who taught it to the users whose data trained it.

AI literacy means seeing the whole picture, including the parts that are deliberately kept out of view.

4.7 Chapter Summary

Let us step back and take stock of what we have covered.

Data is the foundation of every AI system. Without data, machine learning algorithms have nothing to learn from. The quality, completeness, and representativeness of training data directly determines the quality of the AI system built on it.

Data comes from somewhere, and that somewhere matters. Most AI training data comes from existing sources — internet text, institutional records, user activity — that were created for purposes other than AI training. Understanding data provenance helps you evaluate AI systems critically.

Data types shape AI capabilities. Structured data fits in spreadsheets; unstructured data includes text, images, and audio. Labeled data has human-assigned categories; unlabeled data does not. Supervised learning requires labeled data, which means it inherits the judgments embedded in those labels.

Labels are human judgments, not objective truths. The process of labeling data requires defining categories, making judgment calls about edge cases, and resolving disagreements — all of which embed human values and biases into the training data.

Bias enters data through multiple pathways. Selection bias, historical bias, measurement bias, aggregation bias, and ghost data can all distort what an AI system learns. These biases can create feedback loops that amplify existing inequities.

Data ethics involves consent, power, and labor. The people whose data trains AI systems often did not consent to that use. Data collection concentrates power in the hands of collectors. And the labor of data labeling — often done by low-paid workers in the Global South — is essential but invisible.

The threshold concept for this chapter bears repeating: data is never neutral. It encodes the world that created it.

Understanding this principle is your most powerful tool for evaluating any AI system you encounter. Whenever someone tells you an AI is "objective" or "data-driven," your first question should be: "What data? Collected by whom? Labeled how? And who is missing?"

If you want to get your hands on some data, here is a quick exercise using Python's pandas library. This code loads a small, publicly available dataset and checks for representational gaps. You do not need to run this code to follow the rest of the book — it is purely optional.

import pandas as pd

# Load a sample dataset (UCI Adult Income dataset, commonly used in fairness research)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
cols = ["age","workclass","fnlwgt","education","education_num",
        "marital","occupation","relationship","race","sex",
        "cap_gain","cap_loss","hours","country","income"]
df = pd.read_csv(url, names=cols, skipinitialspace=True)

# Check demographic representation
print("Race distribution:")
print(df["race"].value_counts(normalize=True).round(3))
print("\nSex distribution:")
print(df["sex"].value_counts(normalize=True).round(3))

When you run this, you will likely see that the dataset is heavily skewed toward white males. If an AI system trained on this data is used to make predictions about income, whose experiences does it capture best — and whose does it miss?

🗂️ AI Audit Report — Chapter 4 Checkpoint

It is time to add to your AI Audit Report. For the system you selected in Chapter 1, investigate the following:

Data sources: What data does your AI system train on? Can you find documentation about its training data? If not, what does that absence itself tell you?
Data type: Is the training data structured, unstructured, or a mix? Is it labeled? If so, who did the labeling?
Representational gaps: Who might be underrepresented or missing from the training data? Think about demographics (age, race, gender, geography, language, disability status) and contexts (edge cases, unusual situations).
Bias pathways: Using the five types of bias discussed in this chapter (selection, historical, measurement, aggregation, ghost data), identify at least two that could plausibly affect your system.
Data ethics: Apply the four-question framework from Section 4.5 (consent, benefit, control, risk) to your system's data practices.

Add a "Data Analysis" section to your audit report with your findings. It is fine to have unanswered questions — in fact, documenting what you cannot find out is often as valuable as documenting what you can.

🔁 Spaced Review — Chapters 2 and 3 Connection Chapter 2 described AI winters — periods when AI failed to live up to its promises. One major cause of early AI failures was the absence of large-scale data. The data revolution of the 2000s (Section 2.4 in Chapter 2) is what made modern machine learning possible. But as this chapter shows, more data is not always better data. Quantity without quality, representativeness, and ethical sourcing creates new problems even as it solves old ones.

From Chapter 3, recall the distinction between supervised and unsupervised learning. Supervised learning requires labeled data — which means it requires all the human judgment and potential bias that labeling entails. How does this change the way you think about the "accuracy" of supervised learning systems?

✅ Check Your Understanding 1. A company building a resume-screening AI trains it on 10 years of its own hiring decisions. What type(s) of bias is this most likely to introduce? (Historical bias, selection bias) 2. Why is removing a sensitive variable (like race or gender) from training data insufficient to prevent discrimination? (Proxy variables can reconstruct the removed attribute) 3. What is ghost data, and why is it particularly dangerous for AI fairness? (Data about people or situations that were never collected; dangerous because the system does not know what it does not know) 4. Name two things that "datasheets for datasets" would document. (Collection methodology, population represented, known limitations, labeling process, intended use, etc.)

Learning Objectives

In This Chapter

Chapter 4: Data — The Fuel That Powers AI (And Its Biggest Weakness)

Opening Vignette

4.1 Where Data Comes From (And Who Creates It)

The Data Supply Chain

Where the Big Datasets Come From

ContentGuard's Data Sources

4.2 Types of Data: Structured, Unstructured, and Everything Between

Structured Data

Unstructured Data

Labeled vs. Unlabeled Data

4.3 Labels: The Human Judgment Hidden in "Objective" Data

The Labeling Process

When Labels Encode Values

The Ground Truth Problem

4.4 When Data Goes Wrong: Bias, Gaps, and Ghost Data

The Bias Pipeline

Ghost Data: The Bias of Absence

CityScope Predict: A Feedback Loop in Action

4.5 Data Ethics: Consent, Ownership, and Power

The Consent Problem

Data as Power

Data Provenance

4.6 The Hidden Workforce: Who Labels the Data?

The Scale of Data Labeling

The Annotation Economy

Why This Matters for AI Literacy

4.7 Chapter Summary

Python Sidebar: Exploring a Dataset (Optional)

🗂️ AI Audit Report — Chapter 4 Checkpoint