Appendix A: Research Methods Primer

A Practical Guide to AI Ethics Research for Business Professionals


Introduction

AI ethics is an empirical field as much as a philosophical one. When a journalist reports that an algorithm discriminates against Black defendants, or a company claims its hiring tool has been "bias-tested," or a regulator finds that a lender's model violated fair lending law — each claim rests on research methods that produced evidence. Business professionals who want to engage critically with AI ethics must understand how that evidence is generated, what makes it strong or weak, and how to distinguish rigorous scholarship from advocacy dressed in the language of science.

This primer introduces the research methods most commonly used in AI ethics scholarship. It is written for readers without formal training in research methodology. The goal is practical literacy: after reading this appendix, you should be able to read an empirical paper, assess a media report about AI bias, evaluate a vendor's claims about fairness testing, and understand what questions to ask.


Section 1: How AI Ethics Research Is Done

Methodological Pluralism

AI ethics is genuinely interdisciplinary. A single important question — does this algorithm discriminate? — can be studied by computer scientists measuring statistical disparities, sociologists interviewing affected communities, lawyers analyzing regulatory frameworks, economists running audit studies, and journalists obtaining documents through public records requests. Each approach yields different evidence; each has different strengths and blind spots.

This pluralism is a feature, not a bug. No single method can answer all the important questions in AI ethics. Statistical audits can reveal that a facial recognition system performs worse for darker-skinned women, but they cannot explain why the system was deployed in low-income neighborhoods, or what it feels like to be wrongly arrested because of it. Ethnography can capture lived experience but cannot generate the large-sample comparisons needed to detect small but systematic disparities.

The strongest AI ethics research triangulates: it uses multiple methods, each illuminating a different aspect of the problem. The ProPublica COMPAS investigation — discussed throughout this primer as a methodological model — combined statistical analysis of algorithm outputs, investigative journalism, qualitative interviews with affected individuals, and legal analysis of due process implications.

The AI Ethics Research Ecosystem

AI ethics research is produced by several communities that have different norms, incentives, and audiences:

Academic researchers at computer science, law, sociology, and public policy departments. Publication norms include peer review, disclosure of methods, and replication standards — though these vary considerably by discipline. Conference proceedings (especially ACM FAccT, NeurIPS, and ICML) are often more important venues than journals in computer science.

Investigative journalists at outlets including ProPublica, The Markup, MIT Technology Review, and the Washington Post. Journalism produces important first-disclosures of AI failures but operates under different evidentiary standards than academic research. Speed and public interest often take priority over statistical rigor.

Civil society organizations including AI Now Institute, Algorithmic Justice League, ACLU, and the Electronic Frontier Foundation. These organizations often combine research with advocacy, which can strengthen the policy relevance of findings but may also introduce selection bias.

Government agencies including NIST, the FTC, CFPB, and (in the EU) various national regulatory bodies produce both research and regulatory guidance. Government research tends to be authoritative but may lag behind the field.

Consulting and vendor research produced by technology companies and their consultants. This research is subject to strong commercial incentives and should be read with appropriate skepticism, though it is not automatically wrong.


Section 2: Quantitative Methods

Statistical Analysis of AI System Outputs

The most common quantitative approach in AI ethics is to examine the outputs of an AI system — its predictions, recommendations, or decisions — and test for statistical disparities across demographic groups.

Outcome disparity analysis asks: does the system produce systematically different outcomes for different groups? A hiring algorithm that selects 30% of white male applicants but only 12% of Black female applicants has a measurable outcome disparity. The four-fifths rule (80% rule) from EEOC guidelines provides a legal threshold: a selection rate less than 80% of the highest group's rate raises adverse impact concerns.

Error rate analysis asks: when the system makes errors, are they distributed equally? A facial recognition system might have an overall accuracy of 95% but an error rate of 35% for darker-skinned women. Error rates matter because systems used to make consequential decisions harm people through errors, not through accurate predictions.

Calibration analysis asks: when the system predicts a certain probability, does that probability hold across groups? If a recidivism algorithm assigns a score of 7 out of 10 to both Black and white defendants, are both groups equally likely to reoffend at rates corresponding to that score? Calibration is one of the fairness metrics central to the COMPAS controversy (discussed in Section 4).

Audit Studies

An audit study is a controlled experiment in which the researcher sends identical or near-identical requests to a system or institution and observes whether responses differ based on a manipulated characteristic.

In the résumé audit tradition (associated with Bertrand and Mullainathan's landmark 2004 study), researchers submit thousands of nearly identical job applications with names randomly assigned to signal race or gender. Any difference in callback rates can be causally attributed to the name, since everything else is held constant.

The paired-testing methodology used in mortgage lending audit studies sends Black and white "testers" with matched financial profiles to seek loans, then observes whether they receive different terms, are steered to different products, or are discouraged from applying.

Strengths of audit studies: They establish causation, not just correlation. They test actual behavior, not stated policy. They generate the kind of large-sample evidence needed for statistical inference.

Weaknesses: They can be expensive and time-consuming. They raise ethical questions about deception. They capture a snapshot of behavior, not its origins. They cannot always probe the mechanism — did the résumé algorithm reject the "Black-named" candidate because of a name feature, or because of some correlated variable in the training data?

HMDA Analysis

The Home Mortgage Disclosure Act (HMDA) requires U.S. lenders to report detailed data on mortgage applications, including applicant demographics, loan terms, and disposition (approved or denied). This publicly available dataset is one of the richest sources of data for studying algorithmic discrimination in lending.

Researchers analyzing HMDA data must control for legitimate underwriting factors (credit score, debt-to-income ratio, loan-to-value ratio) before attributing racial disparities to discrimination. Sophisticated analyses use regression models to estimate the "unexplained" racial gap in lending outcomes after controlling for observable creditworthiness factors.

A key limitation: HMDA data does not include credit scores, limiting the ability to fully control for creditworthiness. The gap between observed and fully-controlled disparities is a perennial methodological controversy in fair lending research.

How to Read Fairness Metrics

Fairness metrics quantify the degree of disparity in a system's outputs. The Quick Reference Cards appendix provides formal definitions; here we focus on interpretation.

Demographic parity (also called statistical parity or group fairness) requires that the proportion of positive outcomes be equal across groups. If 40% of white loan applicants are approved, 40% of Black applicants should also be approved. Critics argue this ignores legitimate differences in creditworthiness; proponents argue it is the appropriate baseline given historical exclusion.

Equal opportunity requires that the true positive rate — the proportion of qualified applicants who receive a positive outcome — be equal across groups. Under this standard, if 70% of creditworthy white applicants are approved, 70% of creditworthy Black applicants should also be approved. This metric focuses on the consequences for people who deserve a positive outcome.

Predictive parity (calibration) requires that, among people assigned the same risk score, the actual outcome rate be equal across groups. This is the metric that Northpointe (maker of COMPAS) emphasized to defend their algorithm; it is also one of the metrics that ProPublica showed COMPAS satisfied. The controversy is that satisfying predictive parity is mathematically incompatible with equal false positive rates when base rates differ between groups — an impossibility theorem proven by Chouldechova (2017).

The practical implication for business readers: no single fairness metric is definitively correct. Choosing a metric is a value judgment about who bears the cost of error. Legal compliance, ethical considerations, and stakeholder engagement should all inform metric selection.


Section 3: Qualitative Methods

Case Study Methodology

Case studies investigate a single case — an organization, a community, an AI deployment, an incident — in depth. Case study methodology is appropriate when the goal is to understand context, mechanism, or process rather than to generalize across many cases.

Strong case studies are not simply anecdotes. They involve systematic data collection (interviews, documents, observations), rigorous analysis, and explicit attention to alternative explanations. Robert Yin's framework for case study research is the standard methodological reference.

In AI ethics, important case studies have examined individual algorithmic deployments (the COMPAS system in Broward County), specific incidents (Amazon's abandonment of its AI hiring tool), and organizational processes (how tech companies handle internal ethics concerns).

Interpreting case studies: A case study cannot prove generalization — one company's failure tells us about that company. But case studies can establish that something is possible, identify causal mechanisms, and generate hypotheses for larger-scale testing.

Interviews

Semi-structured interviews with affected individuals, system designers, regulators, and other stakeholders generate qualitative evidence about experiences, meanings, and processes that surveys and statistical analyses cannot capture.

Strong qualitative interview studies use purposive sampling (selecting interviewees who can speak to the research question), saturation (continuing until no new themes emerge), and systematic coding of transcripts. Qualitative researchers disagree about how many interviews constitute a sufficient sample; 20-50 is common in published studies, but this depends heavily on the population and research question.

Ethnography

Ethnographic research involves sustained immersion in a field site — observing practice in real time rather than relying on retrospective accounts. Ethnographic studies of technology organizations have been particularly important in AI ethics, revealing the gap between stated principles and actual development practices.

Virginia Eubanks's "Automating Inequality" combined ethnographic research with document analysis and interviews to examine how automated systems affect low-income Americans. Safiya Umoja Noble's "Algorithms of Oppression" combined critical discourse analysis with empirical examination of search results.

Document Analysis

Document analysis — the systematic examination of organizational documents, policy texts, training datasets, technical specifications, and regulatory filings — is an underused but powerful method in AI ethics research.

Researchers have used document analysis to examine: what terms of service actually say versus what companies claim; how AI ethics principles documents differ from actual organizational policies; what training datasets contain; and how regulatory enforcement documents characterize AI failures.

Discourse Analysis

Discourse analysis examines language — in media coverage, corporate communications, policy documents, and public debate — to understand how ideas are framed, whose voices are centered, and what assumptions are naturalized. Discourse analysis has been used to examine how AI is framed in tech journalism, how "bias" is defined in AI ethics versus civil rights frameworks, and how the AI ethics principles movement constructs the problem it claims to solve.


Section 4: Mixed Methods

The ProPublica COMPAS investigation (Angwin et al., 2016) is the most influential piece of research in contemporary AI ethics. Its methodological architecture illustrates how mixed methods work in practice.

The quantitative component: ProPublica obtained two years of COMPAS risk scores for all defendants processed through Broward County, Florida, matched to two-year recidivism outcomes. They calculated false positive and false negative rates separately for Black and white defendants. They found that Black defendants were falsely flagged as high risk at nearly twice the rate of white defendants; white defendants were falsely labeled as low risk despite going on to commit crimes at higher rates. These results were produced by standard logistic regression and chi-square tests.

The investigative journalism component: Reporters identified specific individuals — real people with names, faces, and stories — whose COMPAS scores seemed to contradict their actual outcomes. One story featured a Black woman who received a high-risk score despite a minor offense history; another featured a white man with an extensive criminal history who received a low-risk score. These case studies made the statistical finding concrete and human.

The legal and policy analysis: The investigation contextualized its findings in constitutional due process concerns (defendants have no right to challenge algorithmic scores), the history of risk assessment in criminal justice, and the commercial interests of Northpointe, the algorithm's developer.

What happened next: Northpointe published a rebuttal arguing that COMPAS satisfied predictive parity. Other researchers defended and criticized both sets of claims. Chouldechova (2017) proved mathematically that the two sets of fairness criteria cannot both be satisfied simultaneously when base rates differ. This debate — still ongoing — is among the most important methodological controversies in AI ethics.

The lesson for mixed methods: The quantitative analysis would not have had the same public impact without the human stories. The human stories would not have had the same analytical weight without the statistical evidence. The legal analysis connected both to actionable policy concerns. Each component amplified the others.


Reading Court Decisions

Court decisions are primary sources of legal authority. A federal appellate decision is binding on courts within that circuit; a Supreme Court decision is binding nationally. When reading court decisions for AI ethics purposes, focus on:

  • The holding: what the court actually decided
  • The reasoning: what legal test or standard the court applied
  • Whether the decision is majority, concurrence, or dissent (only majority opinions have precedential weight)
  • The procedural posture: many important AI ethics cases are dismissed at early stages, meaning courts never reach the substantive merits

The most important AI ethics-related legal areas include: Title VII disparate impact doctrine; Fair Housing Act; Equal Credit Opportunity Act; constitutional due process (for government uses of AI); Fourth Amendment (for surveillance AI); and emerging AI-specific statutes.

Regulations and Enforcement Actions

Regulations are written by administrative agencies pursuant to statutory authority. They carry the force of law. Enforcement actions — the FTC's consent orders, the CFPB's supervisory guidance, the EEOC's technical assistance documents — are not law but signal regulatory interpretation.

Key sources for tracking regulatory developments: - Federal Register for proposed and final U.S. federal regulations - Regulations.gov for public comment files - Agency websites for enforcement actions and guidance documents - EUR-Lex for EU regulations and implementing acts - State legislative tracking services for state-level AI laws

Tracking Regulatory Developments

AI regulation is moving faster than any textbook can capture. For ongoing monitoring:

  • NIST's AI RMF resource portal (airc.nist.gov)
  • Future of Privacy Forum regulatory tracker
  • IAPP (International Association of Privacy Professionals) AI governance resource center
  • Stanford HAI policy tracker

Section 6: Evaluating AI Ethics Claims

Evaluating Media Coverage

AI ethics receives extensive media coverage, much of it technically imprecise. When evaluating a news report about AI bias or harm, ask:

  • What is the evidence? Does the reporter describe a specific study, dataset, or methodology? Or is the claim based on anecdote or expert assertion?
  • What is the sample? How many cases are described? Is this representative of the system's overall behavior?
  • Who are the sources? Are affected communities represented, or only developers and officials?
  • What is the comparison? "The algorithm made errors" is less meaningful than "the algorithm made errors at significantly higher rates for Group X than Group Y."
  • What was the alternative? A system that performs worse than perfect may still perform better than the status quo it replaces.

Evaluating Vendor Claims

AI vendors frequently make claims about fairness testing, bias reduction, and ethics commitments. Red flags include:

  • No description of methodology: "We tested for bias" without explaining what was tested, how, on what data, by whom
  • Aggregate accuracy without disaggregated results: Reporting 95% accuracy without breaking results down by demographic group
  • Testing only on development data: Models can appear unbiased on training data but show disparities on real-world deployments
  • Self-testing without independent audit: Audits conducted by the vendor or vendor-selected auditors have obvious conflict-of-interest concerns
  • Fairness claims without specifying which metric: As Section 2 explained, a system can satisfy some fairness metrics while badly failing others

Evaluating Academic Papers

Academic peer review is an imperfect but useful quality filter. When reading academic AI ethics papers, assess:

  • Sample size and representativeness: Small samples cannot support strong generalizations
  • Methodology transparency: Can you understand exactly what was done from the methods section?
  • Comparison conditions: What is the baseline? What are the alternatives?
  • Conflict of interest disclosure: Is the research funded by companies with stakes in the outcome?
  • Replication: Have results been replicated by independent researchers?
  • Peer review venue: A top-tier conference (FAccT, NeurIPS) or journal (Science, Nature) provides more quality assurance than an unreviewed preprint, though important work does appear on arXiv first

What Makes a Strong Study

Strong AI ethics research tends to have: 1. A clearly stated research question that the data can actually answer 2. A methodology that is appropriate to the question 3. Transparent reporting of methods, data, and results — including null results 4. Acknowledged limitations 5. Discussion of alternative interpretations 6. Independence from entities with stakes in the outcome


Section 7: Primary Data in AI Ethics

Using HMDA Data

The FFIEC HMDA data explorer (ffiec.cfpb.gov/hmda) provides access to all HMDA loan application records. The data includes: applicant race and ethnicity, income, loan amount, loan type, property location, and disposition (approved, denied, withdrawn). Researchers can download the data and analyze it in statistical software (R, Python, Stata).

A basic HMDA analysis proceeds by: (1) filtering to a geographic market and loan type; (2) running a logistic regression predicting loan denial on race and ethnicity; (3) adding control variables for income and loan-to-value ratio; (4) examining whether the racial coefficient changes and remains significant after controls.

Conducting Your Own AI System Audit

An independent audit of an AI system follows these steps: 1. Obtain access to system outputs (predictions, scores, recommendations) for a representative sample 2. Obtain or infer demographic characteristics of the subjects 3. Calculate outcome rates separately for demographic groups 4. Calculate error rates separately for demographic groups 5. Apply statistical tests (chi-square, logistic regression) to determine whether disparities are statistically significant 6. Document methodology and findings

Many audits use existing administrative data; others require the researcher to generate test cases (the audit study approach).

Freedom of Information Requests

For AI systems used by government agencies, the Freedom of Information Act (federal) and state equivalents require agencies to disclose records including: procurement contracts, vendor documentation, system descriptions, validation reports, and audit results. Effective FOIA requests: - Describe the records sought with specificity - Reference the relevant agency program - Request a fee waiver for research purposes - Anticipate and prepare for common exemptions (law enforcement techniques, personal privacy)

MuckRock (muckrock.com) is a useful platform for filing and tracking FOIA requests.


Section 8: Resources

Key Academic Venues

Journals: - Big Data & Society (SAGE) — open access, interdisciplinary - AI & Society (Springer) — long-running interdisciplinary journal - Ethics and Information Technology (Springer) - Harvard Journal of Law & Technology - Journal of Artificial Intelligence Research (JAIR) — technical but includes ethics work

Conference Proceedings: - ACM FAccT (Fairness, Accountability, and Transparency) — the premier venue for AI fairness research; proceedings freely available - NeurIPS and ICML — top ML conferences with growing ethics tracks - AIES (AAAI/ACM Conference on AI, Ethics, and Society)

Key Research Institutions

  • AI Now Institute (ainowinstitute.org)
  • Algorithmic Justice League (ajl.org)
  • Data & Society Research Institute (datasociety.net)
  • The Markup (themarkup.org)
  • Oxford Internet Institute
  • Alan Turing Institute (turing.ac.uk)
  • Partnership on AI (partnershiponai.org)

Databases

  • Google Scholar — comprehensive academic search
  • SSRN — working papers in law and social science
  • arXiv — preprints in computer science and related fields
  • ACM Digital Library — ACM conference proceedings
  • LexisNexis / Westlaw — legal research (subscription required)

Statistical Software

  • R with the fairness, tidyverse, and lme4 packages
  • Python with pandas, scikit-learn, Fairlearn, and AI Fairness 360
  • Aequitas (open-source audit toolkit from University of Chicago)

This primer has introduced the major research methods used in AI ethics scholarship. The skills of evidence evaluation — asking what the study actually tested, what the comparison condition is, who funded the research, and what the limitations are — apply across all the methods discussed here. Systematic skepticism, applied to vendor claims, media reports, and academic papers alike, is the most important methodological skill a business professional can develop.