Capstone Project 1: Ethical AI Audit

A Comprehensive Framework for Assessing AI Systems in Organizational Contexts


1. Project Overview

Purpose

This capstone project requires you to conduct a rigorous, evidence-based ethical audit of a real AI system deployed in an organizational context. Unlike theoretical exercises, this audit will produce findings that could meaningfully inform decisions about how AI systems are governed, improved, or constrained.

Ethical AI auditing is an emerging professional practice that sits at the intersection of technical analysis, organizational behavior, policy assessment, and ethical reasoning. Audits of this kind are increasingly demanded by regulators, boards of directors, civil society organizations, and the public. This project gives you hands-on experience performing one.

The audit is structured around six dimensions of ethical concern: bias and fairness, transparency, accountability, privacy, societal impact, and governance. Across these dimensions, you will gather evidence, apply analytical frameworks, and produce recommendations suitable for an organizational decision-maker.

Scope

The audit focuses on a single AI system — meaning a specific application of machine learning or algorithmic decision-making, not an organization's entire technology portfolio. The system should be one that makes or meaningfully influences consequential decisions affecting real people: employment outcomes, credit access, healthcare prioritization, law enforcement, content exposure, or similar domains.

Learning Objectives

By completing this project, you will be able to:

  • Apply a structured ethical audit framework to a real AI deployment
  • Gather and triangulate evidence from multiple sources about an AI system's design, use, and impacts
  • Identify specific ethical risks and deficiencies in deployed AI systems
  • Produce professional-grade findings and recommendations suitable for organizational decision-makers
  • Communicate complex ethical assessments to both technical and non-technical audiences
  • Synthesize concepts from across this course — fairness metrics, governance structures, regulatory requirements, transparency mechanisms, and stakeholder analysis — in a single integrated project

2. System Selection

Guidelines for Choosing Your System

The system you audit should meet several criteria:

Consequentiality. The system should make or significantly influence decisions that affect people's lives in meaningful ways. A system that recommends movie preferences is less suitable than one that affects employment, credit, healthcare, housing, or freedom.

Accessibility. You need to be able to gather meaningful evidence about it. Systems with public documentation, published research, investigative journalism coverage, or organizational contacts who can speak with you are preferable to fully opaque systems.

Specificity. Audit a specific system, not a broad category. "AI hiring tools" is too broad; "Workday's Skills Cloud matching algorithm as used by [Employer]" is appropriate.

Ethical interest. The system should present genuine ethical questions worth investigating. If an audit would likely conclude "this system raises no significant concerns," it is probably the wrong system to audit.

If You Have Organizational Access

Students with access to an employer's AI systems are encouraged to audit one of those systems. Organizational audits often produce the most valuable learning experiences because you can access internal documentation, interview employees involved in development or governance, and observe the system's actual use. Obtain appropriate permissions before proceeding and follow your organization's confidentiality requirements.

Suggested Systems for Audit

If you do not have organizational access or prefer to audit a well-documented public system, the following categories have substantial public documentation, published research, and civil society reporting:

Hiring and Recruitment ATS Systems. Applicant tracking systems using AI to score, rank, or filter job candidates. The Equal Employment Opportunity Commission (EEOC) has issued guidance on these. Documented systems include those from HireVue, Pymetrics (now part of Harver), and others. Published audits and academic studies provide audit baselines.

Credit Scoring Models. FICO and VantageScore models and their use by lenders are subject to Equal Credit Opportunity Act requirements. CFPB enforcement actions and published research on disparate impact provide rich material.

Content Recommendation Systems. Recommendation algorithms used by major platforms are documented through platform transparency reports, academic reverse-engineering studies, investigative reporting, and regulatory investigations in the EU under the Digital Services Act.

Predictive Policing Tools. Systems like ShotSpotter, PredPol/Geolitica, and various risk assessment instruments have been extensively studied. The Brennan Center for Justice and AI Now Institute have published detailed analyses.

Clinical Decision Support Tools. Epic's sepsis prediction model, the Optum care management algorithm examined in a landmark 2019 Science study, and various radiology AI tools have extensive published research.

Insurance Pricing Models. Auto and health insurance pricing algorithms have been examined by state insurance commissioners and investigative journalists. ProPublica's investigation of homeowners' insurance pricing is a useful entry point.


3. The Audit Framework

Your audit must address all six dimensions below. For each dimension, this section specifies what you are looking for, what evidence to gather, and how to interpret your findings.

Dimension 1: Bias and Fairness Audit

What you are assessing: Whether the system produces outcomes that differ systematically across demographic groups, and whether those differences are ethically defensible.

Key questions: - Does the system perform differently for different demographic groups (defined by race, gender, age, disability, national origin, or other protected characteristics)? - What fairness metric or metrics does the system's developer use, and are those metrics appropriate given the deployment context? - Are different fairness criteria (demographic parity, equalized odds, calibration) in tension with each other here, and how has the organization resolved those tensions? - Is any observed disparity the result of historical patterns in training data, proxy variables, or feedback loops?

Evidence to gather: - Developer-published accuracy and performance statistics broken down by demographic group - Third-party audits or academic studies of the system's performance - Regulatory findings or enforcement actions - User or affected-party reports of disparate treatment - The system's training data documentation, if available

Interpretation guidance: A finding of differential performance is not automatically a finding of unethical bias. Some differences reflect legitimate distinctions; others reflect historical injustice. Your audit should assess whether the developer has considered this distinction, what justification they have offered, and whether that justification is adequate.

Dimension 2: Transparency Audit

What you are assessing: Whether the system's operation can be understood by those affected by it, those responsible for it, and external oversight bodies.

Key questions: - Can individuals affected by the system's decisions learn how it reached those decisions? - Can organizational decision-makers audit the system's logic? - Does public-facing documentation accurately describe how the system works? - Is there a model card, datasheet, or equivalent technical documentation?

Evidence to gather: - Public documentation: product websites, white papers, published model cards - Regulatory disclosures (adverse action notices for credit decisions, EEOC documentation for employment AI) - Patent filings and academic papers published by the developer - Investigative journalism or civil society reverse-engineering - Any explainability features offered to users or administrators

Interpretation guidance: Distinguish between three levels of transparency: operational transparency (can affected individuals understand the decision made about them?), technical transparency (can qualified auditors assess how the model works?), and process transparency (is the organization's decision to use this system, and on what terms, publicly documented?). A system can be partially transparent at one level and opaque at another.

Dimension 3: Accountability Audit

What you are assessing: Whether clear lines of responsibility exist for the system's design, deployment, maintenance, and impacts.

Key questions: - Who is legally and organizationally responsible for this system's decisions? - Is responsibility clearly allocated between the AI developer, the deploying organization, and any intermediaries? - What happens when the system causes harm? Is there a defined process? - Who within the organization has authority to modify, suspend, or discontinue the system? - Has the organization been held accountable — through litigation, regulation, or public pressure — for this system's impacts?

Evidence to gather: - Terms of service and vendor contracts (portions may be public through procurement records) - Organizational governance documents: AI governance policies, risk committee charters - Litigation history and regulatory enforcement actions - News coverage of incidents attributed to the system - Interviews with organizational stakeholders, where accessible

Interpretation guidance: Diffuse accountability — "everyone is responsible" — functionally means no one is responsible. Look specifically for whether a named individual or committee has the authority and obligation to act when the system causes harm.

Dimension 4: Privacy Audit

What you are assessing: Whether the system's data practices respect the privacy rights and reasonable expectations of those whose data it uses.

Key questions: - What personal data does the system use, and is that use proportionate to the purpose? - Was meaningful, informed consent obtained for this use of data? - Is data retained longer than necessary? - Does the system create new personal data (inferences, scores, profiles) that may be as sensitive as the original inputs? - What data security measures protect this data? - Does the system's data use comply with applicable law: GDPR, CCPA, HIPAA, FERPA, or sector-specific requirements?

Evidence to gather: - Privacy policies and terms of service - Data protection impact assessments (required under GDPR; some organizations publish them) - Regulatory investigations and enforcement actions by data protection authorities - Academic research on data use practices - Public disclosures under California Consumer Privacy Act or equivalent

Interpretation guidance: Distinguish between legal compliance and ethical adequacy. A system may comply with applicable law while still using data in ways that are exploitative, disproportionate, or harmful to people who lack meaningful alternatives to consenting.

Dimension 5: Societal Impact Audit

What you are assessing: The system's effects beyond its immediate users and targets — on communities, institutions, norms, and power structures.

Key questions: - Who is affected by this system beyond those who directly interact with it? - Does the system entrench or amplify existing social inequalities? - Does it affect democratic processes, labor markets, or social trust? - What is the cumulative impact of this system deployed at scale, across many organizations? - Does the system create dependencies or power concentrations that are difficult to reverse?

Evidence to gather: - Academic research on systemic impacts of similar AI systems - Civil society reports and community testimony - Investigative journalism on aggregate effects - Economic analysis of labor market or market competition effects - Political science or sociology literature on institutional effects

Interpretation guidance: Societal impact analysis is inherently speculative about the future and contested in the present. Acknowledge uncertainty while still reaching defensible conclusions. Use scenarios if necessary.

Dimension 6: Governance Audit

What you are assessing: Whether adequate organizational structures exist to ensure ongoing ethical operation of the system.

Key questions: - Does the organization have an AI ethics policy, and does this system fall within its scope? - Is there an ethics review process for AI deployment decisions? - Is there ongoing monitoring of the system's performance and impacts? - Is there a mechanism for affected parties to raise concerns or seek redress? - Does the board or executive leadership receive reporting on AI risks? - Are there external audits or assessments of this system?

Evidence to gather: - Publicly available governance documents: AI policies, responsible AI frameworks - Corporate ESG reports and proxy statements mentioning AI governance - Regulatory filings - Interviews with governance stakeholders, where accessible - Industry association membership and commitments


4. Data Collection Methods

Effective audits require triangulating evidence from multiple sources. Relying on any single source — especially developer-provided documentation — creates blind spots.

Public Documentation Review. Begin with everything the organization has published: website content, white papers, press releases, model cards, transparency reports, and academic papers authored by company researchers. Treat this material as representing the organization's aspirations and claims, not necessarily its reality.

Regulatory and Legal Records. Regulatory enforcement actions, court filings, and administrative proceedings often contain detailed factual findings about system design and impacts that the organization would not voluntarily disclose. Search CFPB, EEOC, FTC, state attorney general, and relevant sector regulator databases. PACER provides federal court records.

Freedom of Information Act Requests. For government-deployed systems, FOIA requests can yield procurement documents, contracts, impact assessments, and internal communications. FOIA requests take time; initiate them early if you plan to use this method.

Academic and Civil Society Research. Search Google Scholar, SSRN, and specialized databases for peer-reviewed research on your system or systems like it. Organizations like the AI Now Institute, Algorithmic Justice League, Center on Privacy and Technology at Georgetown Law, and Upturn publish practitioner-oriented research.

Output Sampling and Testing. Where you can interact with the system directly, systematic testing of outputs can reveal disparities. Audit methods include: creating matched test profiles that differ only in a protected attribute, analyzing outputs across a sample of real cases, and comparing documented decision criteria against observed decisions.

Interviews. Where accessible, interviews with employees, former employees, affected individuals, advocates, and domain experts provide information unavailable in published sources. Prepare structured interview guides. Follow research ethics protocols for human subjects.

Investigative Journalism. Major technology investigations by ProPublica, The Markup, MIT Technology Review, Wired, and similar outlets often contain technical detail about system design and documented harms that supplement academic sources.


5. Deliverables

Required Outputs

Executive Summary (2 pages). A concise summary of your audit findings and top recommendations, written for a senior organizational decision-maker who will not read the full report. Must be self-contained, clearly state the most significant findings, and present recommendations with priority rankings.

Detailed Audit Report (20–30 pages). The full audit document, structured around the six-dimension framework. Each dimension should include: scope of assessment, evidence gathered, findings (including both areas of adequate practice and areas of concern), and dimension-specific recommendations. The report should include footnotes or endnotes for all factual claims.

Risk Register. A structured table identifying each ethical risk identified in the audit, with columns for: risk description, dimension, affected stakeholders, estimated severity (1–5), estimated likelihood (1–5), current controls, and recommended additional controls.

Recommendations with Priority Rankings. A standalone recommendations document that extracts all recommendations from the audit report, organizes them by priority (immediate action required / address within 6 months / address within 18 months), and provides a brief rationale for each priority designation.

Presentation (10–15 slides). A professional slide deck suitable for delivery to an organizational audience. The presentation should tell a coherent story, not merely summarize the report. It should open with the most significant finding, build the case with evidence, and close with a clear call to action.


6. Evaluation Criteria

Criterion Weight Excellent Adequate Inadequate
Evidence Quality 25% Findings are supported by specific, credible, triangulated evidence from multiple sources. Claims are attributed. Findings are generally supported but rely heavily on one source type or include unsubstantiated claims. Findings rely on unsubstantiated assertions or single sources without triangulation.
Framework Application 20% All six dimensions are rigorously applied. Audit goes beyond surface-level assessment to identify specific, concrete issues. Most dimensions are addressed with reasonable depth. Some areas are superficial. Dimensions are addressed only superficially or some are omitted.
Analytical Rigor 20% Analysis distinguishes findings from interpretations, acknowledges uncertainty, addresses alternative explanations, and applies appropriate ethical frameworks. Analysis is generally sound but occasionally overstates conclusions or misses alternative interpretations. Analysis conflates findings with conclusions, ignores counter-evidence, or applies frameworks incorrectly.
Practical Value of Recommendations 20% Recommendations are specific, actionable, prioritized, and directly address identified findings. Each recommendation explains what should change and who should do it. Recommendations are generally relevant but some are vague, unprioritized, or not clearly linked to findings. Recommendations are generic, aspirational, or disconnected from audit findings.
Professional Quality 15% Report and presentation are professional quality: clearly written, well-organized, properly cited, free of errors. Generally professional with minor issues. Significant quality issues affecting credibility or usefulness.

7. Suggested Six-Week Timeline

Week 1: System Selection and Scoping. Select your system. Define the scope of your audit. Begin documentary research. Submit a one-page scoping memo for instructor feedback.

Week 2: Evidence Gathering — Documents and Research. Conduct comprehensive documentary research across all six dimensions. Identify evidence gaps. File FOIA requests if applicable. Identify interview candidates.

Week 3: Evidence Gathering — Primary Research. Conduct interviews. Perform output testing if applicable. Continue documentary research to address gaps identified in Week 2.

Week 4: Analysis. Analyze evidence across all six dimensions. Draft the risk register. Begin drafting the audit report.

Week 5: Writing. Complete the audit report. Draft the executive summary. Develop recommendations and priority rankings.

Week 6: Finalization and Presentation. Finalize all written deliverables. Prepare and rehearse the presentation. Submit all materials.


8. Resources

Audit Frameworks and Tools

  • AI Now Institute's Algorithmic Accountability Policy Toolkit
  • Partnership on AI's ABOUT ML project for documentation standards
  • NIST AI Risk Management Framework (AI RMF 1.0) — particularly the "Govern," "Map," "Measure," and "Manage" functions
  • ISO/IEC 42001 AI Management System standard
  • Fairlearn and IBM's AI Fairness 360 toolkit for quantitative bias assessment

Regulatory Guidance

  • EEOC Technical Assistance on AI and algorithmic decision-making in employment
  • CFPB guidance on use of AI in credit decisions
  • FTC guidance on using AI and algorithms
  • EU AI Act (Regulation 2024/1689) — particularly Annex III on high-risk AI systems

Academic and Practitioner Literature

  • Raji, I. D., & Buolamwini, J. (2019). Actionable Auditing. AIES 2019.
  • Metcalf, J., Moss, E., & boyd, d. (2019). Owning Ethics. Social Research.
  • Kearns, M., & Roth, A. (2019). The Ethical Algorithm. Oxford University Press.
  • Diakopoulos, N. (2019). Automating the News. Harvard University Press.

See also: Chapter 19 (Conducting Ethical Assessments), Appendix B (Fairness Metrics Reference), and Appendix C (Regulatory Reference Guide) of this textbook.


9. Worked Example: What a Strong Audit Finding Looks Like

The following example illustrates the difference between a weak finding and a strong audit finding, using COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) — a risk assessment instrument used in criminal sentencing in multiple U.S. states — as an illustration.

Weak finding (insufficient):

"The COMPAS system may have racial bias. ProPublica found that Black defendants were more likely to be incorrectly labeled high-risk. This raises concerns about fairness."

This finding is weak because it is vague, does not specify the precise disparity, does not engage with the developer's response, does not analyze which fairness criterion is violated or why, and does not produce actionable recommendations.

Strong finding (what to aim for):

Finding 2.1: COMPAS violates false positive rate parity across racial groups.

Evidence: ProPublica's 2016 analysis of 7,000 defendants in Broward County, Florida found that Black defendants who did not reoffend were labeled high-risk at a rate of 44.9%, compared to 23.5% for white defendants who did not reoffend — a difference of 21.4 percentage points. This false positive rate disparity was statistically significant and has been replicated in subsequent analyses.

Developer response: Northpointe (now Equivant) disputed ProPublica's framing, arguing that COMPAS satisfies calibration — that scores mean the same thing across racial groups. Subsequent academic analysis (Chouldechova, 2017; Kleinberg et al., 2016) demonstrated mathematically that when base rates of recidivism differ across groups, calibration and false positive rate parity cannot both be satisfied simultaneously. This is not a disputable empirical claim but a mathematical constraint.

Ethical analysis: In the context of pretrial detention and sentencing, false positive errors — treating someone as high-risk when they would not reoffend — result in deprivation of liberty. This type of error is substantially more harmful than a false negative (underestimating risk). Jurisdictions using COMPAS that prioritize calibration over equal false positive rates are implicitly accepting higher rates of unjustified detention for Black defendants.

Recommendation: Jurisdictions using COMPAS or similar instruments should (a) receive training on the mathematical tradeoffs between fairness criteria, (b) make an explicit, deliberate, publicly accountable choice about which fairness criterion governs their use of the instrument, and (c) conduct regular local validation to assess whether instrument performance matches national benchmarks in their specific population. Priority: Immediate.

This finding is strong because it cites specific quantitative evidence, engages with the developer's counterargument, correctly characterizes the mathematical constraint at issue, applies the ethical framework appropriately to the specific decision context, and produces a specific, prioritized, actionable recommendation.

Your audit findings should aspire to this standard: specific, evidenced, analytically rigorous, and practically useful.


This capstone project synthesizes material from Parts 1 (Foundations), 2 (Bias and Fairness), 3 (Transparency and Explainability), 4 (Accountability and Governance), 5 (Privacy), and 6 (Societal Impact) of this textbook. Students are encouraged to review the relevant chapters as they address each audit dimension.