46 min read

from government statistical agencies and campaign voter files to academic surveys, media organizations, and civic technology platforms."

from government statistical agencies and campaign voter files to academic surveys, media organizations, and civic technology platforms." prerequisites: - "Chapter 1: The Age of Political Data" - "Chapter 2: A Brief History of Polling and Political Measurement" learning_objectives: - "Identify the major producers and consumers of political data in the United States" - "Describe the structure and contents of government data sources including the Census, BLS, and FEC" - "Explain how voter files are constructed, maintained, and used by campaigns and researchers" - "Distinguish between different types of political data: administrative, survey, observational, and digital" - "Analyze the access barriers and power dynamics that shape who can use political data" - "Evaluate the role of civic technology organizations in democratizing data access" key_terms: - "voter file" - "data ecosystem" - "administrative data" - "Census" - "American Community Survey (ACS)" - "Federal Election Commission (FEC)" - "American National Election Studies (ANES)" - "Cooperative Election Study (CES)" - "data broker" - "voter registration" - "open data" - "civic technology" - "data infrastructure" - "metadata" estimated_time: "4 hours" difficulty: "beginner" subject_categories: ["social-behavioral", "quantitative-technical", "practical-skills"]


Chapter 3: The Political Data Ecosystem

Opening Scene: Adaeze Builds a Map

On a Tuesday morning in March, four years before the Garza-Whitfield Senate race, Adaeze Nwosu sat in a co-working space in downtown with a legal pad, a laptop, and a slowly cooling cup of coffee. She had quit her job as a data journalist at a major national news organization two weeks earlier. She had a modest personal savings account, a small planning grant from a civic technology foundation, and an idea that everyone she respected told her was impractical.

The idea was simple: build a nonprofit organization that would make political data accessible to ordinary people. Not to campaigns, which already had data operations worth millions. Not to academics, who had institutional access to datasets and the statistical training to use them. Not to journalists, who at least had colleagues who could query databases and build visualizations. But to the citizen who wanted to know who was funding their state representative. To the community organizer who needed demographic data to make a case for a new bus route. To the high school teacher who wanted to show students how to look up their own voter registration.

Adaeze had spent a decade watching newsrooms build extraordinary data projects---interactive maps of campaign finance, searchable databases of lobbying records, tools that let readers explore election results down to the precinct level. She had built some of those tools herself. But she had also watched those tools disappear when newsrooms cut staff, restructured, or shut down entirely. The journalism industry's economic crisis meant that the infrastructure of civic data was built on sand.

"The problem is not that the data does not exist," she wrote on her legal pad that morning. "The problem is that the data is scattered across dozens of agencies, formats, and access systems, with no common interface, no consistent standards, and no one responsible for making it usable."

Then she drew a map. On the left side, she listed every source of political data she could think of: the Census Bureau, the Bureau of Labor Statistics, the Federal Election Commission, state election offices, county registrars, campaign finance databases, legislative tracking systems, court records, polling firms, academic surveys, social media platforms, news archives. On the right side, she listed the people who needed that data: voters, journalists, researchers, teachers, community organizers, advocacy groups, small campaigns that could not afford expensive data consultants.

In the middle, she drew a gap---a wide, empty space representing the infrastructure that did not yet exist. The infrastructure that would translate raw government data into usable tools. The infrastructure that would connect scattered datasets into a coherent picture. The infrastructure that would make the political data ecosystem navigable by someone who was not a data scientist.

She wrote two words above the gap: "Build this."

OpenDemocracy Analytics was born.

This chapter is about the ecosystem that Adaeze mapped---the vast, complex, sometimes chaotic landscape of political data in the United States. By the end of this chapter, you will have your own map: a comprehensive understanding of who produces political data, where it lives, how it flows, who controls access, and why these questions matter for democracy.

3.1 What Is a Data Ecosystem?

Before we map the political data ecosystem, let us define the concept. An ecosystem, in the biological sense, is a community of organisms interacting with each other and with their physical environment. A data ecosystem is analogous: a network of organizations, technologies, and practices that produce, store, manage, distribute, and consume data. Like a biological ecosystem, a data ecosystem has producers and consumers, flows and cycles, symbiotic relationships and competitive dynamics.

The political data ecosystem in the United States includes:

  • Producers: Organizations that generate or collect political data---government agencies, campaigns, polling firms, academic researchers, social media platforms, news organizations.
  • Intermediaries: Organizations that aggregate, process, enrich, or redistribute political data---data brokers, data vendors, civic tech nonprofits, libraries, archives.
  • Consumers: People and organizations that use political data for analysis, decision-making, reporting, or advocacy---campaigns, journalists, researchers, citizens, advocacy groups.
  • Infrastructure: The technologies, standards, formats, and legal frameworks that enable data to flow between producers and consumers---databases, APIs, file formats, open data policies, freedom of information laws, privacy regulations.

A key feature of any data ecosystem is that the same organization can play multiple roles. The Census Bureau is a producer (it collects demographic data) and an intermediary (it processes and publishes that data for public use). A campaign is a consumer (it uses voter files and polling data) and a producer (it generates contact data through canvassing and phone banks). A civic tech organization like ODA is an intermediary (it aggregates and reformats data from multiple sources) and a consumer (it uses that data for its own analysis and reporting).

💡 Intuition: Think of the political data ecosystem like a river system. Data flows from sources (government agencies, campaigns, platforms) through channels (databases, APIs, websites) to users (analysts, journalists, citizens). Some channels are wide and accessible; others are narrow and gated. Some are well-maintained; others are silted up with outdated formats and broken links. Mapping this system is the first step toward navigating it effectively.

3.2 Government Data: The Foundation

The largest and most important producers of political data in the United States are government agencies. The data they produce is, with some exceptions, publicly available---created with taxpayer money and, at least in principle, accessible to all. In practice, as Adaeze discovered, "publicly available" and "practically accessible" are not the same thing.

The Census Bureau

The United States Census Bureau is the single most important source of demographic and geographic data for political analysis. Its products include:

The Decennial Census. Conducted every ten years (most recently in 2020), the decennial Census attempts to count every person living in the United States. It collects basic demographic information---age, sex, race, ethnicity, household structure---for the entire population. The Census is constitutionally mandated (Article I, Section 2) because its results determine the apportionment of seats in the U.S. House of Representatives and, consequently, the allocation of Electoral College votes.

For political analysts, the decennial Census is the foundational dataset. It provides the population counts used to draw congressional and legislative district boundaries (redistricting), allocate federal funding, and establish the demographic benchmarks against which surveys and voter files are compared. When a pollster weights a survey sample to match "known population parameters," those parameters often come from the Census.

The American Community Survey (ACS). While the decennial Census provides a basic headcount, the ACS provides the detailed demographic, economic, and housing data that political analysts use most frequently. Conducted continuously (with results published annually), the ACS collects information on education, income, employment, language use, immigration status, commuting patterns, housing costs, health insurance coverage, and dozens of other variables.

The ACS is available at multiple geographic levels: national, state, county, Census tract, and (with some limitations) Census block group. This granularity makes it invaluable for political analysis. Want to know the median household income in a specific state legislative district? The ACS can tell you. Want to compare educational attainment across Congressional districts? The ACS has the data.

Geographic products. The Census Bureau also produces the geographic files---boundary files for states, counties, cities, Census tracts, and blocks---that underpin all political mapping. Without Census geography, you could not draw a precinct map, calculate turnout by neighborhood, or analyze the demographic composition of a congressional district.

📊 Real-World Application: When Nadia Osei builds a demographic profile of the state where the Garza-Whitfield race is taking place, her starting point is ACS data. She downloads population estimates by age, race, ethnicity, education, and income for every county and Census tract in the state. She then merges this data with the voter file to estimate the demographic composition of registered voters versus the total population---because the people who are registered to vote are not a perfect mirror of the people who live in the state.

The Bureau of Labor Statistics

The Bureau of Labor Statistics (BLS) produces the economic data that is central to political analysis. Its most important products include:

The unemployment rate. Released monthly, the unemployment rate is one of the most politically consequential statistics in America. It shapes media coverage, public mood, and voting behavior. As you will learn in Chapter 18, economic indicators like unemployment are among the strongest predictors of election outcomes in fundamentals models.

The Consumer Price Index (CPI). A measure of inflation that tracks changes in the prices of goods and services over time. Inflation is a politically potent issue---voters feel it directly in their grocery bills and gas prices---and CPI data is used in countless political analyses.

Employment and wage data. Detailed data on employment by industry, occupation, and geography, as well as wage and earnings data. These datasets allow analysts to assess the economic conditions that shape the political environment.

The Federal Election Commission

The Federal Election Commission (FEC) is the regulatory body that oversees campaign finance for federal elections. Its most important contribution to the political data ecosystem is the public disclosure of campaign financial records.

FEC filings include:

  • Candidate financial reports: How much money each federal candidate has raised, from what sources, and how they have spent it.
  • Individual contributions: Every contribution of more than $200 to a federal candidate, party committee, or political action committee (PAC) is publicly disclosed, along with the donor's name, address, employer, and occupation.
  • PAC and party committee reports: Fundraising and spending by political action committees, super PACs, party committees, and other political organizations.
  • Independent expenditure reports: Spending by outside groups (super PACs, 501(c)(4) organizations) for or against specific candidates.

FEC data is one of the most-used political datasets in the country. Journalists use it to track who is funding campaigns. Researchers use it to study the relationship between money and political outcomes. Campaign operatives use it to assess the financial strength of their opponents. And citizens use it---or could use it, with the right tools---to understand who is spending money to influence their elections.

⚖️ Ethical Analysis: FEC disclosure requirements reflect a democratic principle: citizens have a right to know who is trying to influence their elections. But disclosure has limits. Contributions below $200 are not disclosed. Donations to certain nonprofit organizations (501(c)(4)s) that spend money on politics are not disclosed at all, a practice known as "dark money." And the sheer volume of FEC data---millions of records per election cycle---means that meaningful analysis requires technical skills that most citizens lack. OpenDemocracy Analytics was founded in part to address this accessibility gap.

Other Federal Data Sources

The political data ecosystem also draws on data from:

  • The Department of Justice: Voting rights enforcement data, including information on redistricting litigation and Voting Rights Act compliance.
  • The Department of Education: Data on school demographics, funding, and outcomes that intersects with education policy debates.
  • The Centers for Disease Control and Prevention: Public health data that became acutely political during the COVID-19 pandemic.
  • The Environmental Protection Agency: Environmental data that connects to climate policy debates.
  • The Federal Reserve: Economic data on interest rates, monetary policy, and financial conditions.

Each of these agencies produces data that is publicly available but varies widely in format, accessibility, and ease of use. Some agencies maintain excellent data portals with user-friendly interfaces and well-documented APIs. Others publish data in formats that have not been updated since the 1990s.

State and Local Government Data

Below the federal level, state and local governments produce an enormous volume of politically relevant data, much of it poorly standardized and difficult to access:

State election offices maintain voter registration databases, election results, and candidate filing information. The quality, completeness, and accessibility of these records vary enormously from state to state. Some states publish their voter file online for free; others charge fees ranging from $25 to $25,000, and some restrict access to political parties and candidates.

County registrars and election boards manage the granular mechanics of elections: precinct boundaries, polling place locations, ballot design, and precinct-level results. Accessing this data often requires direct requests to county officials, and the format varies from county to county.

State legislatures publish roll-call votes, committee records, and bill texts, but the accessibility and searchability of these records are inconsistent. Some states have excellent digital legislative tracking systems; others are years behind.

Local courts produce records of criminal cases, civil suits, and judicial decisions that are relevant to criminal justice policy, judicial selection, and legal disputes over voting rights.

School districts publish data on enrollment, demographics, test scores, and budgets that intersect with education policy debates. In many communities, school board elections are among the most contested local races, and the data surrounding them is often the most difficult to find and analyze.

The fragmentation of state and local data creates not just analytical challenges but democratic ones. A citizen who wants to understand how their local government spends money, how their school district compares to neighboring ones, or how their state legislator voted on a particular bill must navigate a maze of websites, databases, and public records requests, each with its own interface, format, and level of completeness. This navigation requires time, technical skill, and persistence---resources that are not equally distributed in the population. The result is that the citizens who are best able to hold their local government accountable are often those who are already most privileged, while the citizens who most need government accountability---low-income communities, communities of color, immigrant communities---face the greatest barriers to accessing the data that would make accountability possible.

⚠️ Common Pitfall: One of the biggest challenges in political data analysis is the lack of standardization across state and local jurisdictions. There is no single, national voter registration database. There is no uniform format for precinct-level election results. There is no consistent system for linking voter records to Census geography. This fragmentation means that every analysis involving state or local data requires substantial data cleaning, reformatting, and cross-referencing---work that is tedious, error-prone, and essential.

3.3 The Voter File: The Campaign's Most Valuable Asset

If the Census is the foundation of the political data ecosystem, the voter file is its most politically consequential product. The voter file is a database of registered voters maintained by each state's election administration. It is the starting point for virtually everything campaigns do with data.

What Is in a Voter File?

A typical state voter file contains, for each registered voter:

  • Identifying information: Full name, date of birth, gender (in most states), residential address, mailing address (if different).
  • Registration information: Date of registration, party affiliation (in states with partisan registration), registration status (active, inactive, pending).
  • Voting history: A record of which elections the individual voted in (though not how they voted, which is secret). This includes general elections, primary elections, and sometimes municipal and special elections.
  • District assignments: The voter's congressional district, state legislative districts, county, city, school board district, and other jurisdictions.

Some states also include:

  • Race or ethnicity: A handful of states (including several in the South) collect self-reported race on voter registration forms, a legacy of the Voting Rights Act.
  • Telephone number and email address: Some states collect these; availability varies.
  • Precinct and polling place assignment: Where the voter is assigned to vote.

Who Can Access the Voter File?

Access to voter files varies by state, and the rules are one of the most politically consequential aspects of data infrastructure. In general:

  • Political parties and candidates can access the voter file in every state, usually for a nominal fee. This is the legal basis for campaign voter contact operations.
  • Researchers and journalists can access the file in many states, though some states restrict access to "political purposes" or impose conditions on use.
  • Commercial entities can access the file in some states for commercial purposes; other states prohibit this.
  • Individual citizens can usually access their own record but not the full file.

The cost of accessing voter files ranges from free (some states) to tens of thousands of dollars (some states charge per record). This cost structure creates a significant barrier for small campaigns, researchers, and civic organizations.

The Enriched Voter File

For campaigns and political organizations, the raw voter file is just the beginning. A cottage industry of data vendors and data brokers takes the public voter file and enhances it---or "enriches" it---with additional data from commercial sources.

Companies like L2, TargetSmart, and Aristotle merge the voter file with:

  • Consumer data: Purchasing habits, magazine subscriptions, car ownership, estimated income, homeownership status, and hundreds of other variables from commercial data brokers like Acxiom, Experian, and Oracle Data Cloud.
  • Modeled scores: Predicted party preference, ideology, issue positions, likelihood of donating, and other attributes estimated using statistical models trained on survey data and consumer variables.
  • Census demographics: Neighborhood-level demographic data appended to individual records based on residential address.
  • Contact information: Updated phone numbers and email addresses from commercial databases.

The result is an enriched voter file that contains, for each registered voter, not just their public registration and voting history but a comprehensive portrait of their consumer behavior, demographic characteristics, and predicted political attitudes. This is the dataset that campaigns use for microtargeting---the practice of sending different messages to different voters based on their predicted preferences and behaviors.

🔴 Critical Thinking: The enriched voter file raises fundamental questions about privacy, consent, and the power dynamics of political data. Most voters do not know that their registration records are being merged with consumer data to create detailed profiles. They did not consent to this use of their information. And the resulting profiles are used to target them with messages designed to influence their political behavior. Is this a legitimate campaign practice or an invasion of privacy? We will explore this question in depth in Chapter 38.

Nadia Osei's Voter File

When Nadia Osei sets up the Garza campaign's data operation, her first task is acquiring an enriched voter file from a Democratic-aligned data vendor. The file she receives contains approximately 4.2 million records---one for every registered voter in the state---with more than 400 variables per record.

She and her team then add to this file. Every time a Garza volunteer knocks on a door or makes a phone call, the result of that contact is recorded: Was the voter home? Did they answer? Are they supporting Garza, supporting Whitfield, or undecided? Are they willing to volunteer? Are they interested in a specific issue---healthcare, immigration, education, the economy? Over the course of the campaign, these contact records accumulate into a rich behavioral dataset that supplements the vendor-provided file.

"The voter file is not a snapshot," Nadia tells a new volunteer coordinator. "It is a living document. Every conversation you have on a doorstep is a data point. Every data point improves our models. And better models mean we talk to the right people about the right things at the right time."

Jake Rourke, on the Whitfield campaign, has his own version of this infrastructure, provided by a Republican-aligned vendor. His file is somewhat less detailed---the Whitfield campaign's data operation is newer and less well-funded than Garza's---but it contains the same basic architecture: voter records enriched with consumer data and modeled scores, supplemented by the campaign's own contact history.

The difference between the two campaigns' data operations will be one of the factors that shapes the race. We will explore this asymmetry in Chapter 28.

3.4 Academic and Survey Data

The academic research community produces some of the most methodologically rigorous political data available. These datasets are designed not for campaign advantage or commercial gain but for scholarly understanding of political behavior---which makes them invaluable for anyone who wants to understand politics at a deeper level than the day's headlines.

The American National Election Studies (ANES)

The ANES is the grandfather of American political surveys. Conducted by the University of Michigan since 1948 (now in partnership with Stanford University), the ANES is a large-scale survey of the American electorate conducted around each presidential election. It includes:

  • Pre-election and post-election interviews: The same respondents are interviewed before and after the election, allowing researchers to study how opinions change during the campaign and how pre-election attitudes relate to vote choice.
  • Detailed attitude measures: Questions on party identification, ideology, policy preferences, candidate evaluations, political knowledge, media consumption, social identity, and dozens of other constructs.
  • Demographic and socioeconomic data: Detailed information on respondents' education, income, occupation, religion, race, ethnicity, gender, and geographic location.
  • Long time series: Because the ANES has been conducted with consistent core questions since 1948, it allows researchers to track changes in American political attitudes over nearly eight decades. This makes it uniquely valuable for studying trends in partisanship, polarization, trust, and participation.

The ANES data is publicly available for free through the ANES website, making it one of the most accessible academic datasets in political science.

The Cooperative Election Study (CES)

Formerly known as the Cooperative Congressional Election Study (CCES), the CES is a large-scale online survey conducted by a consortium of universities, with data collection managed by YouGov. Its defining feature is its size: the CES surveys approximately 60,000 respondents in each election year, compared to the ANES's typical sample of 2,000 to 6,000.

This large sample size makes the CES invaluable for studying small subgroups of the electorate that are too rare to analyze in smaller surveys: voters in specific congressional districts, members of small racial or ethnic groups, or people with unusual combinations of demographic characteristics.

The General Social Survey (GSS)

Conducted by NORC at the University of Chicago since 1972, the GSS is a broad social attitudes survey that includes many politically relevant questions on topics like trust in government, racial attitudes, gender roles, religious beliefs, and moral values. While not exclusively focused on politics, the GSS provides essential context for understanding the social and cultural foundations of political behavior.

Other Academic Data Sources

The academic ecosystem also includes:

  • The Pew Research Center's American Trends Panel: A probability-based online panel that produces regular reports on political attitudes, media consumption, and social trends.
  • The Voter Study Group (Democracy Fund): A longitudinal panel study that tracks the same respondents over multiple years, allowing researchers to study how individual-level political attitudes change over time.
  • State-level academic polls: University-based survey centers (e.g., Quinnipiac, Monmouth, Marist) conduct public polls that provide state-level data often unavailable from national surveys.

🔗 Connection: In Chapter 5, you will download and analyze data from one or more of these academic sources using Python. In Chapter 10, you will learn to evaluate the quality of polls by examining their methodology, sampling, and weighting---skills that apply whether the poll was conducted by a university, a media organization, or a campaign.

3.5 Media Data

News organizations are both consumers and producers of political data. They commission polls, create data visualizations, build searchable databases, and generate enormous volumes of text, video, and audio content that is itself a form of political data.

Polling and Forecasting

The most visible form of media-produced political data is the commissioned poll. Major news organizations---the New York Times, Washington Post, CNN, Fox News, ABC, NBC, CBS---all sponsor regular public polls, typically conducted in partnership with a polling firm (e.g., the New York Times/Siena College poll, the CNN/SSRS poll).

These polls serve multiple purposes: they generate news stories ("New poll shows Garza leading by 3"), they provide benchmarks for evaluating campaign performance, and they contribute to the broader information environment that shapes voter behavior.

Media organizations also produce election forecasts---probabilistic models that estimate the likelihood of various outcomes. FiveThirtyEight (now part of ABC News), The Economist, and other outlets maintain models that aggregate polling data, fundamentals (economic indicators, approval ratings), and other information to produce win probabilities for individual races and for control of the House, Senate, and presidency.

Data Journalism

Beyond polling, media organizations produce political data through investigative and analytical journalism. Examples include:

  • Campaign finance investigations: Reporters analyzing FEC data to identify major donors, track spending patterns, and expose potential corruption.
  • Redistricting analysis: Journalists using geographic and demographic data to evaluate proposed district maps for partisan bias or racial gerrymandering.
  • Legislative tracking: Databases that track every bill, vote, and amendment in Congress or state legislatures.
  • Fact-checking databases: Systematic records of politicians' claims and their accuracy, maintained by organizations like PolitiFact and FactCheck.org.

The Media as Data

Media content itself is increasingly treated as data. The text of news articles, transcripts of debates, and archives of political advertising can all be analyzed computationally to study framing, sentiment, topic coverage, and bias. You will learn these techniques in Chapter 27, when you work with political text analysis in Python.

Social media platforms---X (formerly Twitter), Facebook, Instagram, TikTok, Reddit, YouTube---generate vast quantities of political expression that researchers and analysts treat as data. This data is valuable but also treacherous: social media users are not representative of the general population, and the algorithms that determine what content is visible introduce biases that are difficult to measure or control.

⚠️ Common Pitfall: It is tempting to treat social media as a window into "what people think" about politics. But social media users are younger, more educated, more politically engaged, and more ideologically extreme than the general population. The discourse on X or Reddit is not public opinion; it is a biased sample of public expression, filtered through platform algorithms designed to maximize engagement, not accuracy. Treating social media data as a proxy for public opinion is one of the most common errors in modern political analysis.

3.6 Campaign Data

Campaigns produce enormous volumes of data, most of which is proprietary and never shared publicly. Understanding what campaigns know---and what they do with what they know---is essential for understanding the political data ecosystem.

What Campaigns Produce

  • Voter contact data: The results of millions of door knocks, phone calls, and text messages, recording voter preferences, issue concerns, and persuadability.
  • Fundraising data: Detailed records of who donated, how much, in response to which appeal, and through which channel.
  • Digital behavior data: Clicks, page views, email opens, ad impressions, and other digital interactions tracked through campaign websites, email lists, and digital advertising platforms.
  • Internal polls: Polls commissioned privately by the campaign and not shared with the public, often with larger samples and more detailed questions than public polls.
  • Field data: Reports from field organizers on local conditions---voter enthusiasm, opposition activity, registration drives, early voting patterns.

The Data Arms Race

The two major parties maintain parallel data infrastructures that support their candidates at every level. On the Democratic side, the primary data platform is maintained by a firm closely aligned with the Democratic National Committee, providing voter file access, modeling tools, and organizing software to Democratic campaigns across the country. On the Republican side, a comparable infrastructure exists through firms affiliated with the Republican National Committee.

These parallel systems create an asymmetry in the data ecosystem: candidates from the two major parties have access to sophisticated data infrastructure that independent and third-party candidates cannot match. This asymmetry raises questions about fairness and representation that connect to the chapter's theme of Who Gets Counted, Who Gets Heard.

The proprietary nature of campaign data also creates a significant gap in public knowledge. Internal polls---surveys commissioned by campaigns for their own strategic use---are typically more frequent, more detailed, and sometimes more accurate than public polls, because campaigns can afford larger samples, more targeted geographic coverage, and more sensitive question design. But these polls are never shared with the public unless the campaign chooses to release them, usually for strategic reasons (a campaign is more likely to release an internal poll that shows it ahead than one that shows it behind). This selective disclosure means that the public's understanding of a race is always incomplete, shaped by the data that campaigns choose to make visible while the data they choose to conceal remains hidden.

In the Garza-Whitfield race, both campaigns commission internal polls approximately every two weeks during the general election. Nadia Osei's team uses these polls to track persuasion metrics, test messaging, and calibrate turnout models. Jake Rourke uses the Whitfield campaign's internal polls to make resource allocation decisions and to prepare the candidate for debates. Neither campaign's internal data is available to the public, to Meridian Research Group, or to OpenDemocracy Analytics. The gap between what the campaigns know and what the public knows is one of the defining features of the information landscape surrounding any competitive election.

🧪 Try This: Go to the FEC's website (fec.gov) and search for campaign finance data for a recent election in your state. Find the total amount raised by each major candidate in a competitive race. Then look at the top individual contributors. What can you learn from this data? What can you not learn? Write down three questions the data answers and three questions it does not.

3.7 The Data Broker Layer

Between the raw data produced by government agencies, campaigns, and other sources, and the enriched datasets used by analysts and campaigns, there is a layer of intermediaries that most citizens never see: the data broker industry.

Data brokers are companies that collect, aggregate, and sell information about individuals. In the political context, the most important data brokers are those that enrich voter files with consumer data. But the data broker ecosystem is much larger than politics. Companies like Acxiom, Experian, and Oracle Data Cloud maintain databases with information on virtually every American adult, compiled from:

  • Public records: Voter registrations, property deeds, court records, motor vehicle records.
  • Consumer transactions: Credit card purchases, loyalty program data, warranty registrations, magazine subscriptions.
  • Online behavior: Website visits, ad clicks, social media activity (where available).
  • Surveys and self-reported data: Warranty cards, contest entries, and other forms where consumers provide information voluntarily.

These databases contain hundreds of variables per individual, organized into categories like "financial behavior," "lifestyle interests," "media consumption," and "political attitudes." The data is sold to marketers, researchers, campaigns, and anyone else willing to pay.

The political application of consumer data is straightforward: if you know that someone subscribes to hunting magazines, drives a pickup truck, and shops at Walmart, you can make a reasonable guess about their political leanings---not because these behaviors determine political attitudes, but because they are statistically correlated with them. Campaigns use these correlations to assign modeled political scores to every voter in the file, even voters who have never been contacted or surveyed.

⚖️ Ethical Analysis: The data broker industry operates with minimal regulation and limited public awareness. Most Americans do not know how much information is available about them, how it is collected, or how it is used. In the political context, this creates an asymmetry of knowledge: campaigns know far more about individual voters than voters know about campaigns' data operations. This asymmetry raises fundamental questions about informed consent, privacy, and the power dynamics of democratic participation. We will address these questions in Chapters 38 and 39.

3.8 OpenDemocracy Analytics: Building the Bridge

Now that you have a map of the major components of the political data ecosystem, let us return to Adaeze Nwosu and the organization she built to bridge the gap between raw data and public understanding.

The Founding Story

Adaeze Nwosu grew up in Houston, Texas, the daughter of Nigerian immigrants who had come to the United States for graduate school and stayed. Her father was a petroleum engineer; her mother was a pharmacist. Education was non-negotiable in the Nwosu household, and Adaeze excelled---graduating as valedictorian of her high school, earning a double major in journalism and computer science at Northwestern University, and landing a fellowship at a prestigious data journalism organization before she was twenty-three.

For the next decade, Adaeze built data projects at some of the most respected newsrooms in the country. She created an interactive database of police use-of-force incidents. She built a tool that allowed readers to explore campaign finance data by donor, candidate, and industry. She designed a visualization of gerrymandering that was viewed more than two million times.

But she grew increasingly frustrated by the structural limitations of journalism as a vehicle for civic data infrastructure. News organizations operated on daily cycles; building durable data tools required long-term investment. Editorial priorities shifted with the news; the data tools that mattered most to communities were the ones that worked reliably over years, not the ones that generated traffic spikes. And when newsrooms contracted---as they did, relentlessly, throughout the 2010s---the data tools were among the first casualties.

The breaking point came when Adaeze learned that a database of local election results she had spent two years building was being shut down because the newsroom that hosted it was closing its data desk. The database had been used by researchers, local journalists, and community organizers across the country. Its closure meant that years of data would become inaccessible, and the communities that depended on it would lose a vital resource.

"I realized that the problem was not the journalism," Adaeze said in a later interview. "The journalism was great. The problem was the business model. You cannot build civic data infrastructure on top of a business model that is in structural decline. You need a different model."

The different model was a nonprofit. Adaeze spent six months raising seed funding from civic technology foundations, developing a board of advisers that included technologists, political scientists, and community organizers, and recruiting a small technical team. She named the organization OpenDemocracy Analytics---"Open" because the tools would be open source, "Democracy" because the mission was democratic participation, and "Analytics" because the work required analytical rigor, not just good intentions.

ODA's Data Infrastructure

ODA's technical infrastructure is designed around a simple principle: take public data that is technically available but practically inaccessible, and make it usable.

The organization maintains several core products:

The Campaign Finance Explorer. A searchable, filterable database of federal and state campaign finance records, drawn from FEC data and state disclosure databases. Users can search by candidate, donor name, employer, industry, or geographic area. The data is updated daily during election season and presented with visualizations that highlight trends, top donors, and spending patterns.

Adaeze built the first version of this tool herself, drawing on her journalism experience. The current version, maintained by a team of two developers, processes more than 20 million records per election cycle. The tool has been used by local journalists investigating campaign finance in state legislative races, by academic researchers studying donor networks, and by community groups tracking corporate political spending. It has also been used, to Adaeze's mixed pride and discomfort, by opposition researchers on campaigns looking for donor connections to attack their opponents---a reminder that open tools serve all comers, not just the ones whose intentions align with the builder's values.

The Voter Information Portal. A tool that allows citizens to look up their voter registration, find their polling place, see what is on their ballot, and explore the candidates running in their district. The portal draws on data from state election offices, candidate filings, and a team of volunteer researchers who compile candidate information.

The District Data Dashboard. An interactive tool that displays demographic, economic, and electoral data for every congressional and state legislative district in the country. Users can explore district-level data from the Census, ACS, BLS, and election results, compare districts, and download data for their own analysis.

The Open Election Data Repository. A standardized archive of precinct-level election results, compiled from county election offices and state databases. This is one of ODA's most labor-intensive projects, because precinct-level data is published in dozens of different formats by hundreds of different county offices. ODA's team standardizes this data into a common format and publishes it under an open license. The repository currently covers twelve states with complete precinct-level results going back to 2012, with plans to expand to all fifty states. The work is painstaking---a single state can require weeks of data cleaning, with county-by-county variations in field names, geographic identifiers, and candidate name formatting requiring individual attention. But the resulting dataset enables analyses that would otherwise be impossible: studying the geographic granularity of partisan shifts, identifying precincts with unusual turnout patterns, and assessing the impact of redistricting on voting behavior at the neighborhood level.

Sam Harding Joins ODA

Sam Harding came to ODA from a data journalism fellowship, where they had spent two years building tools for analyzing legislative text. Sam is 35, non-binary (they/them), and possesses a rare combination of technical skill, writing ability, and public communication talent.

Sam's role at ODA is to translate the organization's data infrastructure into stories, explanations, and analyses that reach a broad audience. They write blog posts explaining how to use ODA's tools, produce visualizations that break down complex political data, and appear on panels and podcasts as a spokesperson for the data transparency movement.

"Data accessibility is not just a technical problem," Sam argues in a frequently cited blog post. "It is a democracy problem. If the only people who can analyze political data are campaign consultants and academics with institutional access, then data-driven insights will serve those who are already powerful. Open data tools shift the balance---not completely, but meaningfully."

Sam's public advocacy has made ODA visible in ways that Adaeze's behind-the-scenes infrastructure work could not. The organization's Twitter following, podcast appearances, and media citations have grown significantly since Sam joined, which has helped with fundraising and recruitment. But it has also created tensions: some of ODA's technical staff feel that Sam's public profile overshadows their work, and Adaeze sometimes worries that ODA's reputation is becoming too dependent on a single charismatic spokesperson.

🔵 Debate: Sam argues that open data tools "shift the balance" of power in democracy. Critics might respond that open data is used most effectively by those who already have the skills and resources to analyze it, meaning it reinforces rather than challenges existing power dynamics. Who is right? How could ODA design its tools to ensure they reach underserved communities, not just data-savvy elites?

3.9 A Taxonomy of Political Data

Having surveyed the major components of the ecosystem, let us organize what we have learned into a taxonomy---a classification system that will help you think about political data throughout this book.

By Source

Source Type Examples Access Cost
Government administrative Census, ACS, BLS, FEC, state voter files Generally public Free to moderate
Government electoral Precinct results, candidate filings, ballot data Public but fragmented Free to moderate
Campaign-generated Voter contact data, internal polls, digital analytics Proprietary Not available
Academic survey ANES, CES, GSS, Pew Public (with registration) Free
Media-produced Polls, forecasts, data journalism Public Free
Commercial/broker Enriched voter files, consumer data Commercial Expensive
Social media Posts, comments, engagement metrics Varies by platform Free to expensive

By Type

Administrative data is collected as a byproduct of government operations. Voter registration records, election results, tax records, and Census data are all administrative data. Its advantage is comprehensive coverage; its limitation is that it records only what the government chooses to track.

Survey data is collected through structured questionnaires administered to samples of the population. Polls, the ANES, and the CES are survey data. Its advantage is the ability to measure attitudes, opinions, and self-reported behaviors; its limitations include sampling bias, nonresponse bias, and social desirability bias.

Observational data is collected by observing behavior without direct interaction. Precinct-level vote totals, campaign finance records, and legislative roll-call votes are observational data. Its advantage is that it records actual behavior rather than self-reported behavior; its limitation is that it typically captures what people did but not why they did it.

Digital trace data is generated as a byproduct of online activity. Social media posts, website visits, ad clicks, and search queries are digital trace data. Its advantage is volume and timeliness; its limitations include lack of representativeness, platform-specific biases, and ethical concerns about consent and privacy.

Each type of data has a different relationship to the truth it claims to represent. Administrative data records what people actually did---registered to vote, cast a ballot, filed a contribution---but it cannot tell you why they did it. Survey data can ask people why they hold a particular opinion or made a particular choice, but it is vulnerable to misreporting, social desirability bias, and the artificiality of the survey context. Observational data captures behavior as it occurs in the real world, but the observer's vantage point determines what is visible and what is hidden. And digital trace data captures what people express online, which may or may not reflect what they actually think, believe, or will do when they enter the voting booth.

The most robust political analyses draw on multiple types of data, using each type's strengths to compensate for the others' weaknesses. A study of voter turnout, for example, might combine administrative data from voter files (who actually voted), survey data from the CES (self-reported reasons for voting or not voting), and observational data from precinct-level results (aggregate patterns of turnout across neighborhoods). This triangulation---using multiple data sources to approach the same question from different angles---is one of the most important practices in political analytics, and it is a theme we will return to throughout this book.

By Accessibility

One of the most important dimensions of political data is accessibility---who can get it, under what conditions, and at what cost. Data accessibility is not just a technical issue; it is a political one, because it determines who can participate in data-driven analysis and who is excluded.

At one end of the accessibility spectrum is open data: datasets that are freely available to anyone, in machine-readable formats, with clear documentation and no restrictions on use. FEC data, Census data, and academic survey data generally fall into this category.

At the other end is proprietary data: datasets that are owned by private organizations and available only to those who pay for access or meet specific criteria. Campaign voter contact data, internal polls, and enriched voter files are proprietary.

In between is a vast gray zone of data that is technically public but practically inaccessible---because it is published in inconvenient formats, behind paywalls, or without documentation. State voter files, precinct-level election results, and state-level campaign finance data often fall into this category.

This gray zone is where much of the most important political data lives, and it is where organizations like ODA focus their efforts. Transforming technically-public-but-practically-inaccessible data into genuinely usable data is unglamorous work---it involves writing code to scrape government websites, standardize inconsistent field names, resolve conflicting geographic identifiers, and document the cleaning decisions that were made along the way. But it is essential work, because the accessibility of data determines who can participate in data-driven analysis and therefore who has a voice in data-driven politics.

🔗 Connection: In Chapter 5, you will work with openly accessible political data in Python---downloading, cleaning, and analyzing datasets from the Census, FEC, and academic surveys. The technical skills you develop there will build on the conceptual map of the data ecosystem you have constructed in this chapter.

3.10 Power, Access, and the Data Gap

Let us step back from the technical details and consider a broader question: who benefits from the current structure of the political data ecosystem, and who is disadvantaged by it?

The Information Asymmetry

The political data ecosystem is structured in a way that creates significant information asymmetries---differences in what different actors know and can do with data.

Campaigns vs. citizens. Campaigns, particularly well-funded ones, have access to enriched voter files, internal polls, sophisticated modeling tools, and teams of data analysts. They know an enormous amount about individual voters---their demographics, their consumer behavior, their predicted political preferences, their likelihood of voting. Voters, by contrast, know relatively little about what campaigns know about them, how that information was obtained, or how it is being used. This asymmetry means that the relationship between campaigns and voters is not one of equals engaging in democratic dialogue; it is one of informed strategists targeting relatively uninformed citizens.

Large campaigns vs. small campaigns. The cost of modern data infrastructure---enriched voter files, modeling tools, analytics staff---gives well-funded campaigns a significant advantage over underfunded ones. A Senate campaign with a $10 million budget can afford a full data operation; a state legislative campaign with a $50,000 budget cannot. This creates a disparity in the quality of information available to different candidates, which may affect electoral competition.

National vs. local. The political data infrastructure is much stronger at the national and state levels than at the local level. Precinct-level election data, local campaign finance records, and municipal government data are often incomplete, unstandardized, or inaccessible. This means that the analytical tools available for understanding national politics are far superior to those available for understanding local politics---even though local government affects daily life more directly.

Data-rich vs. data-poor communities. Some communities are better represented in political data than others. Areas with higher voter registration rates, more active civic organizations, and more media coverage generate more data and are better understood by analysts. Areas with lower registration rates, fewer organizational resources, and less media attention are data-poor---and therefore less visible to the campaigns, media organizations, and policymakers who rely on data.

Adaeze's Dilemma

Adaeze Nwosu understands these asymmetries intimately, and they shape ODA's mission. But she also understands the limitations of what a small nonprofit can do.

"We can make data accessible," she tells her board during a strategy meeting. "We can build tools that anyone can use. We can publish analyses that translate complex data into clear language. But accessibility is necessary and not sufficient. If the people who most need this data---low-income communities, immigrant communities, communities of color---do not know our tools exist, or do not have the internet access to use them, or do not have the time or background knowledge to interpret what they find, then we are just building a beautiful library in a neighborhood where nobody reads."

This is the central tension of the civic data movement: the belief that open data can democratize political knowledge, set against the reality that access to data is only one of many barriers to political empowerment. The others---education, time, trust, language, infrastructure---are not problems that data tools alone can solve.

🌍 Global Perspective: The United States has one of the most extensive public political data ecosystems in the world, thanks to strong freedom-of-information laws, mandatory campaign finance disclosure, and the tradition of a publicly funded Census. In many other democracies, far less political data is publicly available. In authoritarian systems, political data is tightly controlled by the state, and independent data collection can be dangerous or illegal. The American data ecosystem is imperfect, but its relative openness is a democratic asset worth protecting.

3.11 Mapping the Ecosystem: A Visual Framework

Let us consolidate everything we have covered into a visual framework that you can use as a reference throughout this book. Think of the political data ecosystem as consisting of five layers:

Layer 1: Raw Data Production. Government agencies (Census, BLS, FEC, state election offices), campaigns, polling firms, academic researchers, media organizations, social media platforms, and commercial data brokers all produce raw data.

Layer 2: Data Processing and Enrichment. Data vendors merge, clean, and enrich raw data---combining voter files with consumer data, standardizing precinct-level results, geocoding addresses, and building modeled scores. This is the layer where raw data becomes analytically useful.

Layer 3: Analysis and Modeling. Campaign analytics teams, academic researchers, media analysts, and civic technologists apply statistical and computational methods to processed data, producing estimates, models, forecasts, and insights.

Layer 4: Communication and Dissemination. The results of analysis are communicated to various audiences through polls, news stories, visualizations, dashboards, academic publications, campaign briefings, and public data tools.

Layer 5: Decision and Action. Campaign managers make resource allocation decisions. Voters make choices about registration, turnout, and candidate selection. Policymakers make decisions about legislation and regulation. Media editors make decisions about coverage. All of these decisions are informed, to varying degrees, by the data that flows through the ecosystem.

The ecosystem is not a one-way flow. Decisions at Layer 5 generate new data that feeds back into Layer 1. A voter's decision to cast a ballot creates a new entry in the voter file. A campaign's decision to send a mail piece generates data on response rates. A policy decision creates new government statistics. The ecosystem is a cycle, not a pipeline.

🧪 Try This: Using the five-layer framework, map the data flow for a specific political data product you use or encounter regularly (e.g., an election forecast, a campaign finance database, a voter guide). Identify the raw data sources (Layer 1), the processing steps (Layer 2), the analytical methods (Layer 3), the communication channels (Layer 4), and the decisions it informs (Layer 5). Where are the weakest points in the chain?

3.12 Looking Ahead: From Map to Practice

This chapter has given you a comprehensive map of the political data ecosystem. You know the major producers, intermediaries, and consumers of political data. You know the difference between administrative, survey, observational, and digital data. You understand the access barriers and power dynamics that shape who can use data and who cannot. And you have met Adaeze Nwosu and Sam Harding at OpenDemocracy Analytics, who are trying to bridge the gap between raw data and public understanding.

In the next two chapters, you will move from mapping the ecosystem to working within it. Chapter 4 will teach you how to think like a political analyst---the intellectual habits, analytical frameworks, and ethical commitments that distinguish rigorous analysis from casual data browsing. Chapter 5 will put you in front of a computer, downloading and analyzing your first political dataset in Python, using the open data sources described in this chapter.

As you move forward, keep Adaeze's map in mind. Every dataset you encounter has a provenance---a history of production, processing, and distribution. Every analysis you conduct draws on data from specific layers of the ecosystem, with specific strengths and limitations. And every conclusion you reach will be shaped by what the ecosystem makes visible and what it leaves in the dark.

The data ecosystem is not the territory. It is a map of the territory---incomplete, imperfect, and always under construction. Your job as an analyst is to use the map wisely, to know its limitations, and to always be looking for the places where the map and the territory diverge.

One final thought before we summarize. Adaeze Nwosu keeps a note pinned above her desk that reads: "Data does not democratize itself." This is a reminder that the existence of public data is necessary but not sufficient for democratic empowerment. Data must be found, cleaned, analyzed, interpreted, communicated, and acted upon---and at every step, there are barriers that prevent some people from participating. Overcoming those barriers requires not just technical skill but institutional commitment, sustained funding, and a willingness to design for the people who need data most, not just the people who are easiest to reach.

This is the ethos that drives ODA's work, and it is an ethos that runs through this entire textbook. Political analytics is not just about numbers. It is about people---the people who produce the data, the people who analyze it, and the people whose lives are shaped by the decisions it informs. Keeping those people in view, even when you are deep in a dataset or debugging a model, is the mark of a responsible analyst.

Chapter Summary

This chapter has mapped the political data ecosystem in the United States---the interconnected network of government agencies, campaigns, media organizations, academic institutions, data brokers, and civic technology organizations that produce, process, analyze, and distribute political data.

You have learned about the major government data sources (Census, ACS, BLS, FEC), the structure and significance of voter files, the role of data brokers in enriching voter records with consumer data, the academic survey infrastructure (ANES, CES, GSS), the media's dual role as data consumer and producer, and the proprietary data operations that campaigns guard closely.

You have met Adaeze Nwosu and Sam Harding at OpenDemocracy Analytics, and you have learned about ODA's mission to bridge the gap between raw public data and accessible civic tools. You have grappled with the information asymmetries that characterize the ecosystem---the ways in which well-funded campaigns, data-rich communities, and technically skilled analysts have advantages that others lack.

And you have been introduced to a five-layer framework for understanding data flow: from raw production through processing, analysis, communication, and decision-making, with feedback loops that connect action back to data generation.

The political data ecosystem is vast, complex, and constantly evolving. But it is not incomprehensible. With the map you have built in this chapter, you are ready to begin navigating it---first conceptually, in Chapter 4, and then practically, in Chapter 5.


In Chapter 4, you will learn to think like a political analyst---developing the intellectual habits, analytical frameworks, and ethical commitments that distinguish careful analysis from casual data browsing. And in Chapter 5, you will download your first political dataset and begin working with it in Python, putting the conceptual foundations of Part I into practice.