Appendix C: Data Sources Guide
Political analytics depends on data, and most of the best political data is publicly available — if you know where to look. This appendix catalogues the primary sources used throughout the textbook, plus additional sources for independent research projects. Each entry describes what the source contains, how to access it, what limitations to keep in mind, and how to cite it in academic work.
The data landscape changes. URLs move, organizations merge, and access policies evolve. If a URL in this appendix no longer works, search for the source name directly and check whether the host organization has migrated its repository. The sources themselves are generally stable even when their web addresses are not.
C.1 Government and Official Data
U.S. Census Bureau
What it contains: The Census Bureau is the primary source for demographic and socioeconomic data on the American population. Two programs are most relevant to political analysis:
The American Community Survey (ACS) is an ongoing annual survey of approximately 3.5 million households. It produces estimates of income, education, occupation, housing, citizenship, and language use at geographic levels ranging from the nation down to census tracts and block groups. The ACS releases one-year estimates (available for areas with populations above 65,000), three-year estimates (discontinued after 2013), and five-year estimates (available for all geographic areas including small rural counties).
The Decennial Census counts every U.S. resident every ten years (most recently 2020). The decennial census provides population counts used for congressional apportionment and redistricting. The detailed demographic data previously collected in the "long form" of the decennial census is now collected through the ACS.
How to access: data.census.gov provides a browser interface. For programmatic access, the Census Bureau offers a robust API at api.census.gov. The cenpy Python package provides a convenient wrapper. Pre-built datasets for common political science uses (county-level demographic data merged to election years) are available from the Harvard Dataverse and the Social Explorer platform (subscription required for some features).
Key limitations: ACS estimates for small geographic areas (particularly census tracts with under 5,000 people) carry large margins of error. For counties with populations below 20,000, five-year estimates are more reliable than one-year estimates but describe an averaged time period. Population estimates between census years are modeled and carry uncertainty. Citizenship and immigration questions have a complex political history; response rates on immigration-related items may be lower in certain communities.
Suggested citation format: U.S. Census Bureau. (2022). American Community Survey 5-Year Estimates [Data set]. Retrieved from https://data.census.gov.
Bureau of Labor Statistics (BLS)
What it contains: The BLS produces the official measures of employment, unemployment, wages, and price levels in the United States. For political analysts, the most commonly used products are: the monthly Current Population Survey (CPS) unemployment rate; the Local Area Unemployment Statistics (LAUS) program, which produces monthly unemployment estimates at the state, county, and metropolitan statistical area level; and the Consumer Price Index (CPI), the primary measure of inflation.
How to access: bls.gov provides a data query tool. The BLS public API (api.bls.gov) allows programmatic retrieval of time series data with a registration key (free). County-level LAUS data is downloadable as flat files from the BLS website.
Key limitations: The headline unemployment rate (U-3) counts only people without jobs who actively searched for work in the past four weeks. It excludes discouraged workers and the underemployed. Alternative measures (U-4 through U-6) capture broader labor market distress. The seasonally adjusted national rate and the not-seasonally-adjusted county rate are different products measured differently; compare like with like.
Suggested citation format: U.S. Bureau of Labor Statistics. (2024). Local Area Unemployment Statistics [Data set]. Retrieved from https://www.bls.gov/lau/.
Federal Election Commission (FEC)
What it contains: The FEC is the regulatory agency responsible for campaign finance disclosure. It collects and publishes financial disclosures from all federal candidates, political parties, and political action committees. The data includes contributions received, disbursements, debts, and independent expenditure reports. The FEC's bulk data downloads contain every itemized contribution above $200 reported to the agency.
How to access: fec.gov/data provides a search interface for individual candidates and committees. For bulk analysis, the FEC provides full data exports at fec.gov/data/browse-data/?tab=bulk-data. The FEC also maintains a public API. OpenSecrets (described in Section C.3) provides a more analysis-ready version of FEC data with organizational classification.
Key limitations: Contributions below $200 need not be itemized and are reported only as aggregated totals. Independent expenditures by "dark money" 501(c)(4) organizations need not disclose their donors, creating a significant blind spot in the money-in-politics picture. The FEC's enforcement capacity has been constrained by partisan deadlock among commissioners; some disclosure violations are not pursued. Data timeliness varies — quarterly reports create lag time between spending and public disclosure.
Suggested citation format: Federal Election Commission. (2024). Campaign Finance Data [Data set]. Retrieved from https://www.fec.gov/data/.
State Election Authorities
What it contains: County-level and precinct-level election results for primary and general elections, including down-ballot races not covered by national data aggregators. Voter file data (registered voters and their turnout history) is maintained by state authorities.
How to access: Every state has an official election authority — usually a Secretary of State office or State Board of Elections. These vary considerably in data quality, accessibility, and openness. The National Conference of State Legislatures (ncsl.org) maintains a directory of state election offices. The MIT Election Data and Science Lab (Section C.3) aggregates and standardizes election returns from many states.
For voter file data (where public), each state has its own rules. Some states (North Carolina, Ohio, Florida) make detailed voter files including turnout history freely available. Others charge fees or require application. The National Voter Registration System through L2 Political and Catalist (commercial vendors) provides cleaned, merged voter file data nationally, but at commercial cost.
Key limitations: Standardization across states is a persistent challenge. County names, candidate classifications, and party labels vary. Some states report uncertified results that are later revised. Precinct boundaries change between elections, complicating longitudinal analysis.
Suggested citation format: [State] Secretary of State / Board of Elections. (2024). Official Election Returns [Data set]. Retrieved from [state-specific URL].
Congressional Record and Legislative Databases
What it contains: The Congressional Record documents floor proceedings, speeches, and votes in Congress since 1873. Legislative status databases track bill sponsorship, committee referrals, amendments, and passage. Roll-call vote records enable analysis of legislative behavior across time.
How to access: congress.gov (the official Library of Congress portal) provides full-text search of the Congressional Record and bill status. GovTrack.us (Section C.4) provides more analysis-friendly access to roll-call votes and legislator data. The Comparative Legislators Database (Comparative-legislators.info) provides comprehensive, cross-nationally comparable legislative data including U.S. members of Congress.
Key limitations: Floor proceedings do not capture most of the real work of legislating, which occurs in committees, party caucuses, and informal negotiations. The Congressional Record is not verbatim — members may revise and extend remarks, and some statements are submitted without being spoken.
Suggested citation format: U.S. Congress. (2024). Congressional Record [115th–118th Congress]. Retrieved from https://www.congress.gov/congressional-record.
C.2 Academic Survey Datasets
American National Election Studies (ANES)
What it contains: The ANES is the gold standard of U.S. public opinion research, continuously conducted since 1948. The flagship Time Series Study surveys Americans before and after presidential elections, with a pre-election interview in fall and a post-election interview in winter. Core measures include vote choice and candidate evaluation, issue positions, party identification, political trust, efficacy, media use, and extensive demographics. The ANES also conducts surveys in midterm election years, pilot studies of new questionnaire items, and special studies (including a panel study following respondents across multiple elections).
How to access: electionstudies.org provides free data access following user registration. Data is available in SPSS, Stata, SAS, and ASCII formats. Most recent studies also offer R and CSV formats. Codebooks documenting every variable and question wording are provided.
Key limitations: Response rates have declined substantially over the ANES's history. Mode effects complicate comparisons across decades as the study moved from in-person interviewing to mixed-mode designs. Sample sizes (typically 2,000–5,000) are insufficient for subgroup analysis (e.g., small racial or ethnic minorities) without combining multiple years. Like all surveys, the ANES captures self-reported behavior; vote validation studies suggest reported turnout significantly exceeds actual turnout.
Suggested citation format: American National Election Studies. (2021). ANES 2020 Time Series Study Pre- and Post-Election Survey [Data set and codebook]. Retrieved from https://electionstudies.org/data-center/.
Cooperative Election Study (CES, formerly CCES)
What it contains: The Cooperative Election Study is the largest regular survey of American political behavior and opinion, with samples typically exceeding 50,000 respondents per election. Fielded in the fall of every federal election year, it covers vote choice, ballot roll-off, issue positions, congressional approval, and validated voting. Because of its size, the CES supports reliable subgroup analysis by state, race, religion, and other characteristics. The study is a consortium of academic teams who each purchase a "module" allowing them to add custom questions to the survey.
How to access: cces.gov.harvard.edu provides public data access. The full cumulative file combines all CES surveys from 2006 to the present with harmonized variables.
Key limitations: The CES is an online panel survey, raising concerns about representativeness of non-internet users (particularly older and lower-income respondents). The large sample size is a double-edged sword: statistically significant differences emerge for very small effect sizes that may lack substantive importance.
Suggested citation format: Schaffner, B., Ansolabehere, S., & Luks, S. (2021). Cooperative Election Study Common Content [Data set]. Harvard Dataverse. https://doi.org/10.7910/DVN/E9N6PH.
General Social Survey (GSS)
What it contains: The GSS has tracked American social attitudes since 1972, making it one of the most important longitudinal surveys in the social sciences. While not exclusively a political survey, the GSS includes consistent measures of political party identification, political ideology, confidence in institutions, civil liberties attitudes, and social policy opinions. Long time series allow analysis of attitude change across half a century.
How to access: gss.norc.org provides data access and a browser-based data explorer (GSS Data Explorer) that allows variable selection and tabulation without downloading. Micro-level data files are freely downloadable.
Key limitations: The GSS was conducted annually from 1972 to 1994, then switched to biennial fielding. Some questionnaire items rotate in and out, limiting continuous time series. Sample sizes (approximately 3,000 per year) limit subgroup analysis. The pandemic disrupted the 2020 fielding, with a web-based survey replacing the standard in-person mode.
Suggested citation format: NORC at the University of Chicago. (2022). General Social Survey [Data set]. Retrieved from https://gss.norc.org/.
Comparative Study of Electoral Systems (CSES)
What it contains: The CSES is a collaborative program of national election studies conducted in over 50 countries, coordinated around a common module of questions. It enables cross-national comparison of electoral behavior, party systems, representation, and democratic legitimacy. For students interested in comparative populism (Chapters 28–32), the CSES is essential.
How to access: cses.org provides all data modules free of charge after registration. The Integrated Module Dataset combines data across all countries and waves.
Key limitations: Question translation and cross-cultural equivalence are persistent challenges in comparative survey research. Countries join and exit the consortium across waves, complicating longitudinal comparisons. The survey relies on national academic institutions as partners, creating variation in sample quality.
Suggested citation format: Comparative Study of Electoral Systems (CSES). (2022). CSES Module 5: Democracy Divided? People, Politicians and the Politics of Populism [Data set]. https://doi.org/10.7804/cses.module5.2022-03-04.
World Values Survey (WVS)
What it contains: The WVS surveys public values, beliefs, and attitudes in approximately 90 countries across multiple waves since 1981. Core measures include political regime preference, institutional trust, social values, religious beliefs, and national identity. The Inglehart-Welzel Cultural Map — derived from WVS data — is a widely cited framework for mapping cross-national cultural variation.
How to access: worldvaluessurvey.org provides free data downloads. The longitudinal file combines data from all waves for trend analysis.
Key limitations: Coverage varies substantially across waves and countries. Quality of survey administration differs across national partners. Comparability across countries with very different political contexts is a fundamental challenge. The WVS does not provide probability samples in all countries; some national samples are convenience-based.
Suggested citation format: Haerpfer, C., et al. (eds.). (2022). World Values Survey: Round Seven – Country-Pooled Datafile [Data set]. JD Systems Institute & WVSA Secretariat. https://doi.org/10.14281/18241.18.
C.3 Campaign and Electoral Data
OpenSecrets / Center for Responsive Politics (CRP)
What it contains: OpenSecrets is the most comprehensive, analysis-ready source for federal campaign finance data. It processes raw FEC filings and adds organizational classification, industry coding, and legislator linkages. Key products include contribution totals by industry and sector for every federal candidate; outside spending by super PACs and dark money groups; lobbyist expenditure data; and a legislative tracking tool linking contributions to sponsorship and vote patterns.
How to access: opensecrets.org provides a full research interface. The OpenSecrets API (at opensecrets.org/open-data) provides programmatic access to many data products. Bulk data downloads are available for registered researchers.
Key limitations: Industry and organizational classifications are applied by CRP analysts and involve judgment calls. Definitional boundaries (e.g., between "finance/insurance/real estate" and "real estate" subindustries) can affect conclusions about which sectors support which candidates. Dark money — contributions to 501(c)(4) "social welfare" organizations that make independent expenditures — is partially captured but incomplete by design.
Suggested citation format: Center for Responsive Politics. (2024). OpenSecrets Campaign Finance Database [Data set]. Retrieved from https://www.opensecrets.org.
Ballotpedia
What it contains: Ballotpedia is a digital encyclopedia of American politics covering elections, candidates, ballot initiatives, and state legislative data. It is particularly valuable for down-ballot and state-level races underserved by national databases. Features include candidate biographies and contact information, election results for thousands of state and local races, ballot measure text and voting results, and coverage of judicial elections.
How to access: ballotpedia.org is freely accessible. Bulk data access requires contacting Ballotpedia directly; the site is not primarily designed for programmatic data extraction.
Key limitations: Ballotpedia relies on volunteers and staff for data entry; coverage of smaller or less visible races may be incomplete. Historical coverage is shallower than for recent elections.
Suggested citation format: Ballotpedia. (2024). [Specific election or candidate entry]. Retrieved from https://ballotpedia.org.
Dave Leip's Atlas of U.S. Presidential Elections
What it contains: Dave Leip's Atlas provides county-level presidential election results from 1836 to the present in a consistent, analysis-ready format. The Atlas includes state and national maps, historical result tables, and the ability to generate custom maps and comparisons.
How to access: uselectionatlas.org. Full data access and mapping tools require a paid subscription (the basic subscription is modestly priced and appropriate for students). Pre-cleaned county-level presidential data is also available from the MIT Election Data and Science Lab.
Key limitations: Like any historical compilation, earlier years carry data quality issues stemming from the original sources (incomplete reporting, varying county boundaries). The Atlas is primarily presidential; congressional results have inconsistent coverage.
Suggested citation format: Leip, D. (2024). Atlas of U.S. Presidential Elections [Data set]. Retrieved from https://uselectionatlas.org.
MIT Election Data and Science Lab (MEDSL)
What it contains: MEDSL is the premier academic resource for U.S. election returns data. It provides standardized, cleaned, documented county-level and precinct-level election results for presidential and U.S. Senate races, along with state legislative results in progress. MEDSL also produces scholarly research on election administration, ballot design, and electoral integrity.
How to access: electionlab.mit.edu and the Harvard Dataverse (dataverse.harvard.edu/dataverse/medsl). All datasets are freely downloadable with documentation.
Key limitations: Precinct-level data is inconsistently available across states and years; coverage expands with each election cycle. Turnout figures require reconciliation with official vote totals that are sometimes revised after initial reporting.
Suggested citation format: MIT Election Data and Science Lab. (2024). County Presidential Election Returns 2000–2020 [Data set]. Harvard Dataverse. https://doi.org/10.7910/DVN/VOQCHQ.
Redistricting Data (MGGG and Princeton Gerrymandering Project)
What it contains: The Metric Geometry and Gerrymandering Group (MGGG) at Tufts provides redistricting shapefiles, standardized precinct-level data, and tools for analyzing district plans. The Princeton Gerrymandering Project grades congressional and legislative maps for partisan fairness and provides analysis of redistricting proposals.
How to access: mggg.org and gerrymander.princeton.edu. The MGGG's PlanScore tool (planscore.org) allows users to upload district plans and receive efficiency gap and other fairness scores.
Key limitations: Fairness metrics for redistricting are contested — different metrics (efficiency gap, mean-median difference, declination) can produce different conclusions about whether a map is gerrymandered. Neither organization is fully neutral; both have been involved in redistricting litigation as expert witnesses.
Suggested citation: MGGG Redistricting Lab. (2024). Districtr Mapping Tool and Data [Software and data set]. Retrieved from https://mggg.org.
C.4 Polling and Public Opinion
RealClearPolitics (RCP)
What it contains: RCP aggregates public polls on presidential approval, generic congressional ballot, and competitive federal and gubernatorial races. It provides polling averages, charts of polling trends over time, and links to individual poll releases.
How to access: realclearpolitics.com. Historical polling averages are accessible through the site's data archives; for systematic data collection, the Wayback Machine and poll-aggregator APIs provide historical access.
Key limitations: RCP's averaging methodology is simple (unweighted recent average) and does not adjust for pollster quality, methodology, or partisan lean. Because RCP includes polls from partisan survey firms, its averages can be skewed. FiveThirtyEight and Silver Bulletin use more sophisticated weighting.
FiveThirtyEight / Silver Bulletin
What it contains: FiveThirtyEight (now under different ownership) and Silver Bulletin (Nate Silver's independent publication) provide probabilistic election forecasts, polling averages, and original political data journalism. FiveThirtyEight's polling database rates pollster quality and provides historical poll results with metadata. Their forecast models combine economic indicators, polling, and structural factors.
How to access: fivethirtyeight.com and silverbulletin.substack.com. Historical FiveThirtyEight forecast models and data are archived on GitHub at github.com/fivethirtyeight/data.
Key limitations: Probabilistic forecasts are frequently misinterpreted as predictions; a 70% probability of winning is not a "prediction" that a candidate will win. The FiveThirtyEight brand has had multiple ownership transitions; data quality and methodological transparency may have changed.
Pew Research Center
What it contains: Pew produces some of the most methodologically rigorous nonpartisan opinion research available. Topics include political values, news media consumption, technology and society, immigration attitudes, and international public opinion. Pew's American Trends Panel is a nationally representative online panel recruited via random address-based sampling.
How to access: pewresearch.org. Pew provides topline results and crosstabs for all major surveys as PDF downloads; micro-level datasets for academic use are available at pewresearch.org/american-trends-panel-datasets after registration.
Key limitations: The transition from phone to online panel survey modes complicates comparisons across Pew surveys conducted before and after 2015. Pew's topic selection reflects institutional priorities that may or may not align with specific research questions.
Suggested citation format: Pew Research Center. (2024). [Survey Title] [Data set]. Retrieved from https://www.pewresearch.org.
Gallup
What it contains: Gallup has tracked American political opinion continuously since 1935, making it the source for the longest time series in U.S. political polling. The presidential job approval series is unmatched in length. Gallup also conducts international polls through the Gallup World Poll.
How to access: news.gallup.com provides topline results and charts. Historical data is accessible through Gallup's BRAIN (Gallup's search and analytics platform, subscription). Some Gallup data is available through the Roper Center iPOLL archive (see below).
Key limitations: Gallup ended its tracking poll (which provided daily approval estimates) in 2017, reducing the granularity of recent approval data. Its current methodology relies heavily on online panels that differ from its historic face-to-face and telephone methodology.
NORC / AP-NORC
What it contains: NORC at the University of Chicago is a leading survey research center that conducts the General Social Survey (above) and high-quality political surveys through its AmeriSpeak panel — a probability-based online panel with nationally representative sampling. AP-NORC conducts and releases public opinion surveys on current affairs in partnership with the Associated Press.
How to access: apnorc.org and norc.org. Individual survey datasets are available through norc.org/Research/Projects.
Key limitations: NORC's surveys are generally high-quality methodologically, but the AmeriSpeak panel shares the limitations of all probability-based online panels — response rates for the recruitment of panel members are below 10%, and sample maintenance requires ongoing adjustment.
C.5 Media and Communication Data
Wesleyan Media Project
What it contains: The Wesleyan Media Project tracks political advertising in U.S. federal and statewide elections. Using Kantar/CMAG data, it monitors the volume, content, tone (positive/negative/contrast), and targeting of political television and digital ads. The WMP produces research reports during campaigns and archives data for academic use.
How to access: mediaproject.wesleyan.edu. Academic data requests are processed through the project's website; data availability varies by cycle.
Key limitations: Airings data reflects purchased airtime, not actual viewership. Tone coding (positive/negative/contrast) is based on automated content classification that may not perfectly match human judgment. Digital advertising data is less complete than television data.
Internet Archive Television News Archive
What it contains: The Internet Archive records news programming from approximately 50 U.S. television news channels and archives them for research and public access. The archive enables keyword search across closed captions, allowing researchers to track how often political topics were mentioned on CNN, Fox News, MSNBC, and local affiliates.
How to access: archive.org/details/tv and the GDELT Television API (below). Search across captions at television.gdelt.org.
Key limitations: Coverage began in 2009; earlier television news must be accessed through commercial archives. Caption quality varies by channel. The archive captures programming but does not provide viewership data — frequency of mention is not the same as audience reach.
GDELT Project
What it contains: The GDELT Project monitors news from online sources worldwide, coding events, themes, and tones using automated analysis. It provides daily counts of mentions of political figures, countries, organizations, and themes; sentiment scores for news coverage; and geographic analysis of news attention. GDELT is extraordinary in its scope: it monitors over 100 languages and produces data updated every 15 minutes.
How to access: gdeltproject.org. GDELT data is hosted on Google BigQuery, allowing SQL queries against the full archive without downloading data. Python wrappers are available. The Global Knowledge Graph (GKG) is the most analysis-ready product for political research.
Key limitations: GDELT's breadth comes at the cost of depth. Automated event and tone coding introduces errors that accumulate at scale. The universe of sources (news websites, not social media or traditional broadcast) is a sample of media, not a census. GDELT is best used for macro-level trend analysis rather than fine-grained content analysis.
Media Cloud
What it contains: Media Cloud tracks online news and social media sharing, enabling analysis of agenda-setting, framing, and the spread of stories through different media ecosystems. It provides story counts, topic tracking, and source mapping across thousands of news outlets.
How to access: mediacloud.org and the Media Cloud API (mediacloud.org/api). Academic accounts provide expanded access.
Key limitations: Media Cloud tracks online sources; broadcast television and print are not captured. Source classification (mainstream vs. partisan, national vs. local) reflects curatorial decisions by the Media Cloud team.
ProPublica APIs
What it contains: ProPublica maintains several data APIs useful for political analysis: the Congress API provides legislative data including bill sponsorship, roll-call votes, and member information; the Campaign Finance API provides FEC data in a convenient format; the Nonprofit Explorer provides IRS Form 990 data for nonprofit organizations including dark money 501(c)(4) groups.
How to access: projects.propublica.org/api-docs. API registration is free; rate limits apply.
Key limitations: The ProPublica Congress API primarily covers recent Congresses; historical coverage is limited. The 990 data reflects nonprofit self-reporting with a multi-year lag before public availability.
C.6 Civic and Accountability Data
Vote Smart
What it contains: Vote Smart collects biographical data, voting records, position statements, campaign finance summaries, and endorsements for candidates and elected officials at federal and statewide levels. The project has been collecting this data since 1992.
How to access: votesmart.org. Structured data is available through the Vote Smart API for academic and nonpartisan uses.
Key limitations: The project relies partly on candidates completing questionnaires; refusal rates are high among incumbents. Data quality for lesser-known offices and candidates is variable.
GovTrack
What it contains: GovTrack aggregates congressional legislative data from congress.gov and provides analysis-ready roll-call votes, bill sponsorship records, legislator biographies, and composite ideology scores based on legislative behavior.
How to access: govtrack.us and the GovTrack API at govtrack.us/developers/api. Bulk data downloads are available at github.com/unitedstates/congress.
Key limitations: Ideology scores are derived from co-sponsorship and voting patterns, not from survey responses; they measure revealed legislative behavior rather than self-reported ideology.
OpenDemocracy Analytics (ODA)
What it contains: The OpenDemocracy Analytics dataset is the primary teaching dataset for this textbook. It integrates county-level U.S. presidential and congressional election results (1980–2024) with demographic data from the American Community Survey, economic indicators from the BLS and Census Bureau, polling data, and media environment proxies. The dataset was constructed to support quantitative political science pedagogy, with realistic statistical properties calibrated against the underlying official sources it synthesizes.
How to access: The ODA dataset is distributed with this textbook through the course materials repository. See Appendix B, Section B.5 for detailed documentation of all six tables, their variable definitions, and standard loading patterns.
Key limitations: The ODA dataset is a pedagogical tool, not an official data release. Researchers undertaking original scholarship should verify findings against the primary sources the ODA dataset draws on. Alaska and Hawaii present geographic challenges and have sparser data coverage. Economic variables for 1980–1989 are based on interpolated estimates and carry higher uncertainty.
Suggested citation format: OpenDemocracy Analytics Project. (2025). ODA County-Level Political Dataset, 1980–2024 [Data set]. Distributed with Political Analytics: From Populism to Polling, Chapter Code Materials.
C.7 Tips for Working with Political Data
Always read the documentation. Every dataset described in this appendix provides a codebook or technical documentation. Before analyzing any variable, read how it was constructed, when it was collected, what population it covers, and what codes are used for missing values. Twenty minutes reading the codebook prevents hours of confused results.
Check for missing data patterns. Missing data is rarely random. If a survey is missing income data more often for lower-income respondents (because they refused the income question), a complete-case analysis will overrepresent higher-income respondents. Use df.isna().sum() as a first step in any new analysis.
Know your unit of analysis. Mixing county-level and individual-level data in the same analysis without recognizing the difference is a path to the ecological fallacy (see Appendix A). Be explicit about whether each variable is measured at the individual, county, district, state, or national level.
Cite your sources. Political data has a history and provenance. When you report that 47% of likely voters support a candidate, or that county-level unemployment correlates with Republican vote share, the reader needs to know where those numbers come from, what year they describe, and what sample they cover. The citation formats in this appendix provide templates.
Be skeptical of pre-cleaned datasets. Analysis-ready datasets like those from OpenSecrets or MEDSL involve judgment calls — how to classify a contribution, how to assign a FIPS code when boundaries changed, how to handle write-in votes. These decisions are usually well-documented and defensible, but they are choices. When results are consequential, trace the numbers back to primary sources to verify.