> "Those who cannot remember the past are condemned to repeat it."
Learning Objectives
- Trace the evolution of data collection practices from ancient civilizations to the present
- Explain how census data has historically been used as an instrument of state power
- Identify the key technological transitions that transformed the scale and nature of data collection
- Analyze the relationship between data collection and colonialism, warfare, and social control
- Evaluate how historical patterns of data use inform contemporary governance challenges
- Connect at least three historical precedents to current data ethics debates
In This Chapter
- Chapter Overview
- 2.1 Counting People: The Ancient Census
- 2.2 Colonial Statistics and the Classification of People
- 2.3 The Punch Card Revolution
- 2.4 The Cold War, Computers, and the Database State
- 2.5 The Internet Revolution
- 2.6 The Big Data Era
- 2.7 The AI Era
- 2.8 Patterns in the Historical Record
- 2.9 Chapter Summary
- What's Next
- Chapter 2 Exercises → exercises.md
- Chapter 2 Quiz → quiz.md
- Case Study: The IBM and Nazi Germany Controversy → case-study-01.md
- Case Study: The Cambridge Analytica Scandal in Historical Context → case-study-02.md
Chapter 2: A Brief History of Data and Society
"Those who cannot remember the past are condemned to repeat it." — George Santayana, The Life of Reason (1905)
Chapter Overview
In Chapter 1, we mapped the data that saturates daily life and introduced the question of governance. But the desire to count, classify, and control through information is not new. Governments have been collecting data on their populations for millennia — and the consequences of that collection have ranged from efficient resource distribution to genocide.
This chapter traces the arc of data and society from the ancient world to the present, not as a linear progress narrative but as a story of recurring patterns. Again and again, we'll see the same dynamics: a new capacity for data collection is developed, those in power use it to consolidate control, marginalized groups bear the costs, and governance mechanisms lag behind the technology. Recognizing these patterns is the first step toward breaking them.
In this chapter, you will learn to: - Connect today's data governance challenges to their historical roots - Identify the power dynamics embedded in seemingly neutral acts of counting and classification - Recognize how technology amplifies existing social structures rather than creating new ones from scratch - Evaluate claims that "this time is different" about data and technology
2.1 Counting People: The Ancient Census
The impulse to count populations is among the oldest acts of governance. The word "census" itself comes from the Latin censere — to assess, to estimate. But counting has never been neutral. To count a population is to make it legible, governable, taxable, and conscriptable.
2.1.1 Mesopotamia, Egypt, and Rome
The earliest known census records date to ancient Sumer, around 3800 BCE, where clay tablets recorded livestock, grain stores, and laborers. These were not exercises in curiosity. They were instruments of resource allocation and tax collection — the original data governance challenge.
In ancient Egypt, the pharaohs conducted regular censuses to assess labor availability for monumental construction projects. The pyramids, in a sense, were built on data: counts of workers, grain inventories to feed them, and administrative records to coordinate the operation.
Rome formalized the census as a civic institution. Every five years, Roman citizens were required to register their families, property, and wealth with the censores. The results determined tax obligations, military service requirements, and voting classifications. A citizen who evaded the census — the incensus — could be sold into slavery.
Reflection: Notice that from the earliest records, data collection served the interests of the powerful — pharaohs, emperors, tax collectors. The people being counted rarely chose to be counted, and they had no say in how the counts were used. Does this pattern persist today?
2.1.2 The Domesday Book
In 1086, William the Conqueror commissioned the Domesday Book, a comprehensive survey of land ownership, resources, and population across England. The name itself — "Domesday," meaning "Day of Judgment" — reflects the finality of its authority. There was no appeal from its findings.
The Domesday Book was a data governance innovation: a centralized, standardized dataset that allowed the crown to know exactly what it owned, what it could tax, and what resources it could mobilize. It was also an instrument of conquest — compiled by Norman administrators cataloguing Anglo-Saxon property for redistribution.
As the medieval historian V.H. Galbraith noted, the Domesday Book was "not a record of what ought to be, but of what is" — but "what is" was defined entirely from the conqueror's perspective.
2.2 Colonial Statistics and the Classification of People
The relationship between data collection and power became particularly stark during the era of European colonialism. Colonial administrators didn't just count populations — they classified them, creating categories of race, ethnicity, caste, and tribe that continue to shape politics and identity today.
2.2.1 The Colonial Census as a Tool of Control
In British India, the decennial census beginning in 1871 didn't simply record existing social categories — it hardened fluid social identities into rigid administrative classifications. Caste identities that had been contextual and negotiable became fixed bureaucratic categories. As the historian Nicholas Dirks argues, the colonial census "created" caste as a pan-Indian system in ways that pre-colonial social organization never had.
The consequences were profound: census categories determined access to education, government employment, and political representation. Groups fought to be reclassified because their census designation directly affected their material circumstances. The data wasn't merely describing society — it was constructing it.
Connection: As we saw in Chapter 1 (Section 1.1.1), the act of categorization is never neutral. Colonial censuses provide a historical warning about what happens when classification systems are imposed by those with power on those without it — a dynamic we'll encounter again in Chapter 14 when we examine algorithmic bias.
2.2.2 Statistics and Scientific Racism
The 19th century saw the rise of statistics as a discipline — the word itself derived from the German Statistik, meaning the science of the state. Statistical methods were developed partly to manage colonial populations and partly to support the pseudoscience of racial classification.
Francis Galton — a pioneer of statistical methods including regression and correlation — was also the founder of eugenics. He developed his statistical tools in service of a project to quantify human "worth" and breed a superior race. The bell curve, the regression line, and the correlation coefficient were born in a context of racial hierarchy.
This history does not invalidate these statistical tools any more than the military origins of the internet invalidate email. But it does demand that we approach quantitative methods with awareness of their origins and attentiveness to the ways seemingly neutral mathematics can encode discriminatory assumptions.
Eli brought this up in class during week three. "Every time someone says 'the data is objective,' I want to ask them: objective according to whom? The data Galton collected was 'objective.' His measurements were precise. His math was correct. And his conclusions were used to justify forced sterilization."
Dr. Adeyemi responded: "Which is why this course isn't just about data. It's about the questions we ask of data, the categories we impose on it, and the power structures that determine who gets to do the asking and imposing."
2.3 The Punch Card Revolution
2.3.1 Hollerith and the 1890 Census
The 1880 U.S. Census took nearly eight years to tabulate by hand. By the time the results were ready, they were almost obsolete. Herman Hollerith, a Census Bureau employee, developed an electromechanical punch card system that tabulated the 1890 Census in just one year.
Hollerith's innovation was foundational: data encoded on physical cards that could be sorted, counted, and cross-tabulated by machine. His company eventually merged with others to form the Computing-Tabulating-Recording Company, which in 1924 became International Business Machines — IBM.
The punch card didn't just accelerate counting. It transformed what was possible to count. Suddenly, governments could cross-tabulate multiple variables: occupation by ethnicity by geography by age. The granularity of population knowledge increased dramatically — and so did the potential for both beneficial governance and targeted oppression.
2.3.2 IBM and the Holocaust
The darkest chapter in the history of data technology came when the Nazi regime used IBM's punch card systems to facilitate the Holocaust. As documented by Edwin Black in IBM and the Holocaust (2001), the Dehomag subsidiary of IBM provided custom-designed punch card systems that enabled:
- Census and registration of the Jewish population
- Tracking of individuals through deportation and transport
- Management of concentration camp logistics
- Cross-referencing of employment, property, and ancestry records
The Nazis could not have conducted the Holocaust with the same ruthless efficiency without data technology. The punch card didn't cause the genocide — that required ideology, political power, and human cruelty. But it enabled the systematic nature of the killing, the ability to identify, locate, transport, and track millions of individuals.
Ethical Dimensions: This case raises a question that runs through the entire history of data technology: What is the responsibility of the technology provider? IBM supplied machines and expertise to the Nazi regime, profiting from contracts that facilitated genocide. The company's defenders argued that IBM merely sold a general-purpose technology; its critics argued that the company customized its systems for the specific needs of the Nazi census and knew how they would be used. We'll return to this question of technology provider responsibility in Chapter 17 (Accountability) and Chapter 29 (Responsible AI Development).
2.4 The Cold War, Computers, and the Database State
2.4.1 The Birth of Digital Computing
World War II accelerated the development of electronic computing — from Colossus (used to crack German ciphers at Bletchley Park) to ENIAC (used for ballistic trajectory calculations at the University of Pennsylvania). After the war, these machines migrated from military to civilian use, and the first commercial databases emerged.
The shift from mechanical punch cards to electronic databases was not merely a change in speed. It was a change in kind. Electronic databases could:
- Store far more data in far less space
- Search and retrieve specific records almost instantly
- Link records across different databases
- Update information in real time
Each of these capabilities had governance implications. The ability to link databases, in particular, created new possibilities for surveillance and control — and new anxieties.
2.4.2 The National Data Center Debate
In 1965, the U.S. Bureau of the Budget proposed a National Data Center that would consolidate statistical data from multiple federal agencies into a single computer system. The proposal was framed in terms of efficiency and better policy-making.
The backlash was immediate and fierce. Congressional hearings in 1966-1968 raised concerns about government surveillance, individual privacy, and the concentration of information power. Representative Cornelius Gallagher warned: "Once you have all the information about a person in one place, you have extraordinary power over that person."
The National Data Center was never built. But the debate it generated led to the first modern data protection legislation: the Fair Credit Reporting Act (1970), which gave individuals the right to access and dispute data held about them by credit agencies, and the Privacy Act of 1974, which regulated how federal agencies collected and used personal information.
Intuition: The 1960s National Data Center debate is the direct ancestor of today's debates about government data consolidation, national ID systems, and centralized digital identity. The arguments — efficiency vs. privacy, convenience vs. control — are remarkably unchanged. What's changed is the scale.
2.4.3 The Rise of Credit Scoring
While the government data center failed politically, the private sector was quietly building exactly what privacy advocates feared — centralized databases of personal information used to make consequential decisions.
The Fair Isaac Corporation (now FICO) introduced the first widely used credit score in 1989. The FICO score collapsed a person's financial history into a single three-digit number that determined their access to credit, housing, and often employment.
Credit scoring illustrates several principles that will recur throughout this book:
- Reduction: A complex financial life reduced to a single number between 300 and 850
- Opacity: Most consumers had no idea how the score was calculated or what factors were weighted
- Consequentiality: The score determined life-altering outcomes (mortgage approval, rental housing, car loans, and eventually job offers)
- Disparate impact: Historical patterns of racial discrimination in lending were encoded in the data used to train scoring models
Mira found the credit scoring history fascinating — and unsettling. "VitraMed is basically building a health score," she realized. "A single number that predicts patient risk. We think of it as helpful. But it's doing the same thing FICO did — collapsing a complex life into a number and letting that number make decisions."
2.5 The Internet Revolution
2.5.1 From Network to Platform
The internet's origin as a military research network (ARPANET, 1969) and its evolution into a commercial platform (the World Wide Web, 1991) is a well-told story. Less often told is the story of how the internet transformed data collection from an institutional activity — conducted by governments and large corporations — into a universal one.
In the pre-internet era, your data footprint was relatively small: government records, financial records, employer records, medical records, perhaps a few loyalty program memberships. The data was siloed — your bank didn't know your medical history, your doctor didn't know your purchase patterns.
The internet collapsed those silos. By the early 2000s, a single company — Google — could know your search interests, email contents, calendar appointments, physical location, and browsing history. By the 2010s, Facebook knew your social relationships, political views, life events, and emotional states — often more accurately than your closest friends.
2.5.2 The Surveillance Business Model
The internet was not destined to become a surveillance infrastructure. Early visions of the web — Tim Berners-Lee's original proposal, the cyberlibertarian manifestos of the 1990s — imagined a decentralized, empowering technology. What happened instead was the discovery that personal data could be monetized through targeted advertising.
Google's innovation was not search — several search engines predated it. Google's innovation was realizing that the data generated by searches — what people wanted, when they wanted it, and where they were — was extraordinarily valuable to advertisers. As Shoshana Zuboff argues in The Age of Surveillance Capitalism (2019), Google pioneered the conversion of "behavioral surplus" — the data exhaust of user activity, beyond what was needed to improve the service — into prediction products sold to advertisers.
This model was replicated by Facebook, Amazon, and eventually most of the internet economy. The result was a fundamental shift in the data landscape:
| Era | Data Collector | Data Subject | Relationship |
|---|---|---|---|
| Pre-internet | Government, employers, banks | Citizens, employees, customers | Relatively transparent; regulated by sector |
| Early internet | Websites, ISPs | Users | Emerging; lightly regulated |
| Platform era | Google, Facebook, Amazon, data brokers | Everyone | Opaque; poorly regulated; asymmetric |
Common Pitfall: It's tempting to blame "technology" for the surveillance business model. But the technology is not the cause — the business model is. The internet could have been funded through subscriptions, public investment, micropayments, or other mechanisms. The choice to fund it through advertising-driven data extraction was a human decision made by specific companies and investors. Understanding this is crucial for imagining alternatives — as we'll explore in Chapter 39.
2.5.3 Web 2.0 and User-Generated Data
The transition to "Web 2.0" in the mid-2000s — from a read-only web to a read-write web — dramatically expanded the volume and intimacy of data generation. Social media platforms invited users to share their thoughts, photos, locations, relationships, and life events. The data was no longer just exhaust from searches and purchases — it was the product of people's creative and social labor.
This created a new dynamic: users became simultaneously the product (their data sold to advertisers), the content creators (their posts attracting other users), and the product of the product (their engagement data used to refine targeting algorithms that made the advertising more profitable). The power asymmetry — the first of our recurring themes — was built into the architecture.
2.6 The Big Data Era
2.6.1 Volume, Velocity, Variety
Around 2010, the term "Big Data" entered mainstream discourse, typically defined by the "three Vs":
- Volume: The sheer amount of data being generated (by 2025, estimated at 463 exabytes per day globally)
- Velocity: The speed at which data is generated and must be processed (real-time streams, not batch processing)
- Variety: The diversity of data types (structured, unstructured, images, audio, sensor data, social media)
Some frameworks add a fourth V — Veracity (the accuracy and reliability of data) — and a fifth — Value (what the data is worth once analyzed).
Big Data was heralded as a revolution in everything from healthcare to urban planning to scientific discovery. And in many cases, it delivered: genomic research accelerated, traffic patterns were optimized, and epidemiological surveillance improved.
But Big Data also amplified every governance challenge we've discussed:
- More data meant more potential for privacy violations
- Faster processing meant less time for ethical review
- Greater variety meant more unstructured data resistant to traditional regulation
- And the promise of value created incentives to collect everything, regardless of whether it was needed
2.6.2 Predictive Analytics and the Shift to Preemption
Big Data's most consequential application may be the shift from descriptive to predictive analytics — from analyzing what has happened to predicting what will happen next.
This shift appears in nearly every domain:
- Policing: From responding to crimes to predicting where crimes will occur (predictive policing)
- Insurance: From insuring populations to pricing individuals based on predicted risk
- Hiring: From evaluating applicants' past performance to predicting their future performance
- Healthcare: From treating conditions to predicting who will develop them
Each of these applications raises the same question: On what basis can we act on a prediction about someone who hasn't done anything yet? Predictive policing sends more officers to neighborhoods flagged as high-risk, which leads to more arrests in those neighborhoods, which generates more data reinforcing the prediction — a feedback loop that targets the already-targeted.
Eli lived this. "I watched them install the ShotSpotter sensors in my neighborhood," he told the class. "Two weeks later, police response times to 'detected gunshots' in our area went up by 400%. Not because there were more gunshots — but because now every car backfire, firework, and slamming dumpster triggered a response. The data said the neighborhood was dangerous. The data was measuring the sensors, not the neighborhood."
2.7 The AI Era
2.7.1 Machine Learning and the New Data Hunger
The current era is defined by the rise of artificial intelligence — specifically, machine learning systems that learn patterns from data rather than following explicit rules.
Machine learning's appetite for data is orders of magnitude greater than previous technologies. Training a large language model like GPT-4 required ingesting hundreds of billions of words of text scraped from the internet. Training an image recognition system requires millions of labeled photographs. Training a self-driving car requires millions of miles of driving data.
This data hunger has transformed the relationship between data and power. Companies that control the largest datasets have the greatest capacity to develop AI systems, which in turn generates more data, which further entrenches their advantage. The result is an unprecedented concentration of data power among a small number of firms — primarily in the United States and China.
2.7.2 Generative AI and the Data Provenance Crisis
The emergence of generative AI — systems that can produce text, images, audio, and video that are difficult to distinguish from human-created content — has introduced a new dimension to the data-society relationship.
Generative AI models are trained on massive datasets, much of it created by humans who never consented to its use for AI training. Artists whose work was scraped from the internet to train image generators, writers whose books and articles were ingested by language models, musicians whose compositions were used to train audio generators — all have raised fundamental questions about data ownership, consent, and compensation.
Meanwhile, generative AI is producing synthetic data at an accelerating rate — AI-generated text appearing in news articles, AI-generated images appearing in social media, AI-generated code appearing in software. This creates a provenance crisis: we can no longer always tell whether the data we encounter was produced by humans or machines, and the training data for future AI systems increasingly includes output from previous AI systems.
Connection: We'll explore the specific ethical challenges of generative AI in depth in Chapter 18. For now, note how this latest development continues the historical pattern: a new data technology emerges, its applications expand faster than governance mechanisms can respond, and the power asymmetry between those who control the technology and those who are affected by it grows wider.
2.8 Patterns in the Historical Record
Looking across this history, several patterns emerge — patterns that are directly relevant to contemporary data governance.
2.8.1 Four Recurring Dynamics
1. The Ratchet Effect. Data collection capabilities expand but rarely contract. The census expands from a population count to a socioeconomic survey. Punch cards enable cross-tabulation. Databases enable linking. The internet enables real-time surveillance. Each new capability becomes the baseline for the next expansion. Governments and corporations rarely un-collect data.
2. Dual Use. Every data technology has been used for both beneficial and harmful purposes, often simultaneously. The same census that allocates public resources can identify minority populations for persecution. The same credit scoring system that expands financial access can perpetuate racial discrimination. The same AI system that accelerates drug discovery can generate disinformation. The question is never "is this technology good or bad?" but "who decides how it's used, and on whose behalf?"
3. The Governance Lag. Governance consistently lags behind technological capability, often by decades. The internet commercialized in the 1990s; the first comprehensive data protection regulation (the GDPR) didn't take effect until 2018. The National Data Center debate of the 1960s produced the Privacy Act of 1974 — a decade later. Today, AI governance is racing to catch up with systems already deployed at scale.
4. The Burden Falls Downward. The costs and risks of data systems are consistently borne disproportionately by those with the least power: colonial subjects, marginalized racial groups, low-income communities, and the Global South. The benefits accrue disproportionately to those with the most: states, corporations, and wealthy nations.
2.8.2 What History Teaches Us
History does not predict the future with precision, but it does reveal what is possible — both the harm that unchecked data collection can enable and the governance innovations that communities and governments have developed in response.
The Fair Information Practice Principles (FIPPs), first articulated in 1973, emerged from the controversies of the database era. Data protection authorities in Europe emerged from the post-war determination to prevent state identification systems from ever again enabling genocide. The GDPR emerged from decades of privacy advocacy in the face of platform capitalism.
"The history isn't just a warning," Dr. Adeyemi told her class. "It's also a source of hope. Every era of unchecked data power has eventually produced a governance response. Our task is to shorten the lag — to build governance capacity before the harm becomes catastrophic, not after."
2.9 Chapter Summary
Key Concepts
- Data collection has been an instrument of state power since ancient civilizations, from censuses to colonial classification systems
- The punch card revolution transformed the scale of data processing and enabled systematic atrocities, including the Holocaust
- The Cold War computing era produced the first modern data protection debates and legislation
- The internet transformed data collection from an institutional activity to a universal one, funded by the surveillance business model
- Big Data introduced predictive analytics, shifting from describing what happened to predicting what will happen
- The AI era's data hunger has concentrated data power among a small number of companies and introduced a provenance crisis through generative AI
Key Debates
- Is technological determinism a useful framework, or does it obscure the human choices that shape technology's effects?
- Are historical analogies (e.g., comparing surveillance capitalism to colonialism) illuminating or misleading?
- Does the governance lag reflect a fundamental structural problem or a solvable political one?
Applied Framework
When encountering a data governance debate, ask: 1. What is the historical precedent for this situation? 2. Who benefited from the precedent, and who bore the costs? 3. What governance response eventually emerged? 4. How is the current situation similar to and different from the historical precedent? 5. What does the precedent suggest about likely outcomes if governance is not developed?
What's Next
In Chapter 3: Who Owns Your Data?, we'll move from history to one of the most contested questions in data governance: the question of ownership. Who has rights over the data generated by your body, your behavior, your creative work, and your digital life? The answers vary dramatically depending on the legal tradition, the type of data, and the theory of ownership applied — and they have profound practical consequences.
Before moving on, complete the exercises and quiz to solidify your understanding of the historical patterns discussed in this chapter.