42 min read

In This Chapter

Opening Hook
Learning Objectives
Section 19.1: What Is an AI Audit?
Section 19.2: Why AI Auditing Is Hard
Section 19.3: Pre-Deployment Auditing — Algorithmic Impact Assessments
Section 19.4: Technical Auditing — What to Test and How
Section 19.5: External Auditing — The Independence Problem
Section 19.6: Regulatory Requirements for AI Auditing
Section 19.7: Auditing Generative AI and Foundation Models
Section 19.8: Building an Organizational AI Audit Function
Discussion Questions

Case Study 01 Case Study 02 Key Takeaways Exercises Quiz Further Reading

Chapter 19: Auditing AI Systems

Opening Hook

On January 1, 2023, New York City's Local Law 144 took effect. The law, passed in 2021, required employers who use automated employment decision tools (AEDTs) in New York City to conduct bias audits of those tools and to disclose the audit results publicly. It was, at the time, the first law in the United States to require that AI systems face the same kind of independent external review that financial statements have faced for decades.

The law was modest in scope. It covered only employment tools — not credit, healthcare, or criminal justice applications. It applied only in New York City — not statewide, let alone nationally. Its requirements for what an audit must cover were limited, and early implementation revealed significant compliance challenges. Critics noted that the law's definition of "bias audit" was narrow, that the specified metrics could be gamed, and that auditors had a financial relationship with the companies they audited that could compromise their independence.

But Local Law 144 was also genuinely significant. It established the principle that AI systems used in high-stakes decisions are not exempt from independent external review simply because they are complex technology. It created a legal obligation to examine what AI systems actually do to real people, not just what they were designed to do. It required that findings be disclosed publicly, so that job applicants, researchers, journalists, and regulators could evaluate them. And it prompted a burst of compliance activity, litigation, and regulatory discussion that is reshaping how AI auditing is practiced and understood.

The story of LL 144 — its passage, its implementation, its limitations, and its influence — is the story of AI auditing in miniature: a field being invented in real time, through litigation, legislation, and professional practice, against the backdrop of AI systems that are already deployed at scale affecting millions of people. This chapter examines what AI auditing is, how it works, where it falls short, and what building genuinely effective AI audit frameworks requires.

Learning Objectives

By the end of this chapter, students will be able to:

Define AI auditing and distinguish between technical, process, impact, and compliance audits.
Explain the primary technical, organizational, and legal challenges that make AI auditing difficult.
Describe the elements of a pre-deployment algorithmic impact assessment and explain its role in the AI governance lifecycle.
Apply technical auditing methods — including data auditing, model auditing, and output auditing — to evaluate an AI system's fairness, accuracy, and performance.
Analyze the challenges of external AI auditing, including the access problem, the independence problem, and the credentialing gap.
Summarize the key requirements of NYC Local Law 144, the EU AI Act's conformity assessment requirements, and other regulatory mandates for AI auditing.
Discuss the specific challenges of auditing generative AI and foundation models, including red-teaming and capability evaluation.
Design a basic organizational AI audit function, including scope, frequency, governance structure, and integration with AI development processes.

Section 19.1: What Is an AI Audit?

Defining AI Auditing

An AI audit is a systematic examination of an AI system's design, development, deployment, and outcomes, conducted for the purpose of assessing its compliance with relevant standards, its performance across affected populations, and its potential for harm. The concept draws on the long tradition of financial auditing — the independent external review of financial statements — but applies it to a fundamentally different subject matter with fundamentally different characteristics.

Financial auditing has a well-developed institutional infrastructure: a recognized body of standards (Generally Accepted Auditing Standards, or GAAS), professional credentialing (the Certified Public Accountant designation), mandatory disclosure requirements for public companies, and regulatory oversight of the audit profession by the Public Company Accounting Oversight Board (PCAOB). AI auditing has none of these: no universal standards, no recognized credential, no mandatory disclosure regime covering most AI applications, and no institutional oversight of the audit profession. It is a field being built in real time, and the standards being established now will shape what AI auditing means for decades.

The core purpose of an AI audit is to provide independent, credible information about an AI system's actual behavior and effects — information that the organization deploying the system cannot be expected to provide objectively about itself. Just as financial auditors provide credibility to financial statements that managers have an interest in presenting favorably, AI auditors provide credibility to claims about AI system performance that AI developers and deployers have an interest in presenting favorably.

Types of AI Audits

AI audits take several distinct forms, each with its own focus, methodology, and relationship to accountability:

Technical audits examine the AI system itself — the model architecture, training data, performance metrics, and behavior across different input types and demographic groups. A technical audit answers questions like: How accurate is the system? Does its accuracy vary across demographic groups? Is it robust to adversarial inputs? Is it fair by specified fairness metrics? Technical audits require access to the model and its training data, sophisticated technical expertise, and a clear specification of what metrics constitute adequate performance.

Process audits examine the development and deployment practices surrounding the AI system — the organization's procedures for training, testing, documenting, and maintaining AI systems. A process audit answers questions like: Did the organization follow its own development standards? Was adequate testing conducted? Was documentation completed? Were the right approvals obtained before deployment? Process audits are closer to traditional operational audits and require less specialized AI technical expertise, though they still require enough technical understanding to evaluate whether documented practices are meaningful.

Impact audits examine the real-world effects of an AI system on affected populations. Rather than asking whether the system performs well on test data, impact audits ask: What has actually happened to the people this system has affected? Have loan approval rates changed for certain demographic groups since the AI was deployed? Has recidivism actually decreased among people the risk algorithm predicted to be low-risk? Has the AI-based hiring tool increased or decreased demographic diversity among new hires? Impact audits require outcome data that may be difficult to obtain, and they must address the counterfactual question (what would have happened without the AI system?) that is inherently difficult to answer.

Compliance audits examine whether an AI system and its deployment comply with applicable legal and regulatory requirements. This is the most concrete form of AI audit, because compliance requirements are explicitly defined — the question is whether the system meets them, not whether it should. Compliance audits are most directly required by laws like NYC LL 144, and they are the type that organizations typically prioritize because the legal consequences of non-compliance are clear.

Internal vs. External Audits

AI audits can be conducted internally — by the organization's own audit function — or externally, by independent third parties. Each has distinct roles and limitations.

Internal audits have the advantage of access: internal auditors can examine the model, the training data, the development process, and the organizational culture with a completeness that external auditors rarely achieve. But internal audits have a fundamental conflict of interest: internal auditors report to management, and management has an interest in favorable audit findings. The audit profession has developed extensive independence requirements to address this problem in financial auditing; AI auditing is only beginning to develop analogous standards.

External audits provide independence that internal audits cannot. An external auditor with no financial relationship to the audited organization and no reporting relationship to its management can provide credible assurance that the internal auditors cannot. But external auditors face the access problem: organizations are reluctant to give external parties access to proprietary models, training data, and development practices. External auditors are often limited to reviewing documentation, conducting interviews, and testing system behavior through the deployment interface — rather than examining the system from the inside.

Effective AI governance typically requires both: internal audits to monitor ongoing performance and compliance throughout the development and deployment lifecycle, and periodic external audits to provide credible independent assurance to regulators, investors, customers, and the public.

Continuous vs. Point-in-Time Auditing

AI systems are not static: they change through retraining, parameter updates, and changes in the data and environment in which they operate. This creates a fundamental challenge for point-in-time auditing — an audit conducted at a single moment in time may not represent the system's behavior over time, and the system audited may not be the system that is deployed six months later.

The most sophisticated AI governance frameworks incorporate continuous monitoring — ongoing tracking of performance metrics, demographic disparities, and anomalies — alongside periodic comprehensive audits. The continuous monitoring provides early warning of performance degradation or emerging fairness issues; the periodic audit provides the comprehensive assessment and independent verification that continuous monitoring cannot.

Vocabulary Builder

AI audit: A systematic examination of an AI system's design, development, deployment, and outcomes.
Bias audit: An audit focused specifically on whether an AI system produces systematically different outcomes for different demographic groups.
Algorithmic accountability: The obligation of AI developers and deployers to explain and justify algorithmic decisions and their effects.
Third-party audit: An audit conducted by an independent external organization with no financial relationship to the audited party.
Model card: A standardized documentation template for AI models, describing their intended use, performance metrics, limitations, and training data.
Datasheets for Datasets: A standardized documentation template for AI training datasets, analogous to datasheets for electronic components.
Impact assessment: A systematic evaluation of an AI system's potential effects on affected populations, conducted before deployment.

Section 19.2: Why AI Auditing Is Hard

AI auditing faces challenges that have no direct analog in financial auditing. These challenges fall into three categories: technical, organizational, and legal.

Technical Challenges

The access problem is the most fundamental technical challenge: to conduct a meaningful technical audit, auditors need access to the model, its training data, and the deployment context. Model access means being able to examine the architecture, query the model systematically, and test its behavior across a wide range of inputs. Data access means being able to examine the training data for representativeness, quality, and potential sources of bias. Deployment context access means understanding how the model is integrated into organizational processes and what decisions it influences. Organizations treat models and training data as highly proprietary; providing external access to this information creates intellectual property risks and competitive exposure. As a result, external auditors often have to work with much less information than a technically complete audit would require.

The opacity problem is specific to modern machine learning systems. Deep neural networks — the class of models that underpins most contemporary high-performing AI — are not designed to be interpretable. They consist of millions or billions of numerical parameters, adjusted through training processes that are not transparent to human inspection, producing outputs through computational processes that do not yield human-interpretable explanations. Auditing a system that cannot explain its own decisions is fundamentally different from auditing a rule-based system whose logic can be specified and inspected. Explainability techniques — LIME, SHAP, saliency maps, attention visualization — can provide partial insight, but they are themselves imperfect and contested, and their outputs are not always reliable guides to actual model behavior.

The dynamic systems problem refers to the fact that AI systems change over time. Retraining on new data, parameter updates, and changes in the deployment environment all alter system behavior. An audit conducted on a system at one point in time may not accurately describe the same system six months later. Continuous monitoring is required, but continuous monitoring is expensive and technically demanding.

The context dependence problem refers to the fact that AI system behavior can differ dramatically across deployment contexts. A model that performs well for the population it was tested on may perform poorly for the population it is actually deployed to serve. A model whose behavior is appropriate when used by trained professionals may be misused when deployed in a consumer-facing context. Auditing the system in the lab does not guarantee that the audit results generalize to the deployment context.

Organizational Challenges

Resistance to audit access. Many organizations treat their AI systems as proprietary intellectual property — the source of competitive advantage — and are reluctant to provide the access that meaningful external auditing requires. This resistance is compounded by legal concerns: if the external audit finds problems, those findings could become evidence in litigation. The combination of IP concern and litigation risk creates powerful organizational incentives to limit external access.

The expertise gap. AI auditing requires a rare combination of skills: technical understanding of machine learning sufficient to evaluate model behavior; domain expertise in the context in which the AI is deployed (healthcare, finance, criminal justice, employment); legal knowledge sufficient to assess regulatory compliance; and organizational process expertise sufficient to evaluate governance practices. Very few individuals have all of these competencies, and the audit profession is only beginning to develop the training and certification pathways that would produce qualified AI auditors at scale.

Conflicts of interest. Internal auditors report to management. Management has an interest in favorable audit findings. This structural conflict of interest is well-recognized in financial auditing, where the audit profession has developed extensive independence requirements: auditors cannot audit clients for whom they have performed other services, cannot hold financial interests in audited companies, and are required to be technically independent of management. AI auditing has not yet developed equivalent standards, and many of the organizations providing AI auditing services also provide consulting, development, and compliance services — creating financial relationships that compromise independence.

Scope definition. What does an AI audit cover? The model? The data? The deployment process? The outcomes for affected populations? The organizational governance practices? Different answers to this question produce audits of dramatically different scope, depth, and value. Without agreed standards for what an AI audit must cover, "audit" can mean anything from a superficial documentation review to a comprehensive technical and organizational examination.

Legal Challenges

Trade secret protection. AI models and training data are typically protected as trade secrets — proprietary information that provides competitive value because it is not publicly known. Trade secret law protects this information from disclosure, including in legal proceedings (with some limits). This protection creates legal obstacles to external AI auditing: organizations can argue that providing auditors with access to models and data would constitute trade secret disclosure that the law does not require. Resolving the tension between trade secret protection and audit access is one of the central unsolved legal problems in AI governance.

Liability risk from audit findings. An audit that finds material problems with an AI system creates documented evidence that could be used against the organization in litigation. Organizations that have been audited and found problematic may be worse off in litigation than organizations that have never been audited — because the plaintiff can point to the audit findings as evidence that the organization knew about the problem and failed to fix it. This perverse incentive — where auditing increases litigation risk — must be addressed through some form of legal protection for audit findings if mandatory auditing is to be effective. Analogies exist: aviation incident reporting is protected from use in litigation precisely to encourage honest reporting.

The absence of universal standards. Without agreed standards for what constitutes an adequate AI audit, it is difficult to evaluate whether a particular audit was conducted competently. This makes both regulatory enforcement and private litigation challenging: what standard of care does an AI auditor owe? What would a "deficient" audit look like? These questions are beginning to be answered through emerging professional standards documents, but the field is far from the settled standards that govern financial audit.

Section 19.3: Pre-Deployment Auditing — Algorithmic Impact Assessments

The Concept

The most important moment in the AI governance lifecycle is before deployment. Once an AI system is deployed at scale, it becomes entrenched — in organizational processes, vendor contracts, user behavior, and technical infrastructure — in ways that make correction costly and politically difficult. This is the Collingridge dilemma in practice. Pre-deployment assessment is the most effective point at which to identify and address potential harms.

An algorithmic impact assessment (AIA) is a systematic, documented process for evaluating an AI system's potential effects on affected populations before deployment. The concept draws on two established models from other domains: environmental impact assessment (required for major federal projects under the National Environmental Policy Act) and privacy impact assessment (required for federal information systems under the E-Government Act and for certain processing activities under GDPR). Both models share the core logic of the AIA: before taking consequential action, document its potential harms, assess their severity and likelihood, and plan for mitigation.

Elements of a Comprehensive AIA

A thorough algorithmic impact assessment should contain the following elements:

System description. A clear description of the AI system: what it does, how it works, what data it uses, what outputs it produces, and how those outputs are used in decision-making. This should be sufficiently detailed that a technically competent reviewer can understand the system's design, but sufficiently accessible that non-technical stakeholders can engage with it.

Intended use and use cases. A specification of the contexts in which the system is intended to be used, the decision-making processes it is intended to support, and the populations it is intended to serve. This is important not just for understanding the system, but for identifying cases of potential misuse or out-of-context deployment.

Affected populations. An identification of all populations who will be directly or indirectly affected by the system's deployment — not just the intended users, but everyone whose interests are affected by the decisions the system influences. In an AI hiring context, the affected population includes job applicants (who are directly affected), existing employees (whose advancement may be affected), and the organization as a whole (whose workforce composition is affected).

Potential harms: systematic identification. A systematic enumeration of potential harms, organized by harm type (physical, economic, dignitary, privacy, psychological, social), by affected population, and by likelihood and severity. This is the most intellectually demanding part of an AIA: identifying harms requires imagination and adversarial thinking, not just technical testing. It should include scenarios of unintended use and misuse, not just intended use.

Fairness analysis. A demographic performance analysis that examines how the system performs across relevant demographic groups — typically race, gender, age, disability status, and whatever other characteristics are protected under applicable law or of legitimate public concern. This analysis should include multiple fairness metrics (demographic parity, equalized odds, predictive parity — discussed in Chapter 9), because different metrics capture different aspects of fairness and can give different verdicts on the same system.

Mitigation plan. A description of how identified risks will be addressed: through design changes, deployment constraints, human oversight requirements, monitoring mechanisms, or user communication. The mitigation plan should be specific and realistic, not aspirational. It should identify who is responsible for each mitigation measure and what the timeline for implementation is.

Monitoring plan. A description of how the system will be monitored after deployment for ongoing performance, demographic disparities, and emerging harms. Who will conduct the monitoring? What data will be collected? At what frequency? What will trigger a formal review or a decision to discontinue the system?

Government Requirements for AIAs

Several jurisdictions have already moved beyond voluntary AIAs to mandatory requirements:

Canada's Directive on Automated Decision-Making (2019, revised 2023) requires federal government departments to complete an Algorithmic Impact Assessment before deploying automated decision systems. The depth of the required assessment is scaled to the stakes of the decision: systems that make decisions with significant individual impact require more comprehensive assessment than systems that provide low-stakes recommendations. The directive also requires peer review of high-impact systems and public disclosure of assessment results.

The UK Government's AI Assurance Techniques guidance document (2022) describes a range of assessment approaches for AI systems used in government contexts, though it stops short of mandatory requirements for all uses.

Proposed U.S. legislation. The Algorithmic Accountability Act, first introduced in 2019 and reintroduced multiple times, would require companies above a size threshold to conduct impact assessments for consequential automated decision systems and to report findings to the FTC. As of this writing, the bill has not been enacted, but it has shaped regulatory and policy debate.

Internal AIA Processes: Corporate Practice

Several major technology companies have developed internal impact assessment frameworks:

Microsoft's Responsible AI Impact Assessment framework provides structured templates for assessing AI systems for fairness, reliability, privacy, security, and transparency. Microsoft publishes the framework publicly and has integrated it into its development processes for Azure AI services.

IBM's AI Fairness 360 and associated governance frameworks include impact assessment components, with particular emphasis on quantitative fairness testing.

Google's Responsible AI practices include internal review processes for products with significant AI components, including structured harm assessment.

These voluntary corporate frameworks are meaningful — they institutionalize the practice of pre-deployment harm assessment — but they have significant limitations. They are self-assessed, creating the conflicts of interest that independent review is designed to address. They are not publicly disclosed in most cases, limiting external accountability. And they may be applied inconsistently across an organization or abandoned when they conflict with deployment timelines.

The Limitations of Impact Assessments

Impact assessments are only as good as the imagination of those conducting them. A harm that no one on the assessment team thinks to consider will not appear in the assessment. A harm that affects a population not represented on the assessment team may be systematically overlooked. And an organization under competitive pressure to deploy may conduct assessments that are technically compliant but not genuinely searching.

These limitations suggest that effective AIAs require diverse assessment teams that include perspectives from affected communities, independence from the development team to avoid groupthink, and external review for high-risk systems. They also require regulatory backing: voluntary AIAs are meaningful governance artifacts but insufficient accountability mechanisms.

Section 19.4: Technical Auditing — What to Test and How

Data Auditing

The performance of any AI system is fundamentally constrained by the quality and representativeness of the data on which it was trained. Data auditing is the examination of training data to identify potential sources of bias, quality problems, and ethical concerns.

Representativeness analysis examines whether the training data adequately represents the population to which the model will be deployed. A facial recognition system trained on images that underrepresent darker-skinned faces will perform worse on darker-skinned individuals — a finding documented across commercial face recognition systems by Buolamwini and Gebru (2018) in the Gender Shades study. A credit-scoring model trained on a historical dataset that excluded certain populations will have poor predictive validity for those populations. Representativeness analysis requires knowing the characteristics of the training data population and comparing them to the characteristics of the deployment population.

Quality assessment examines error rates, missing data patterns, labeling consistency, and other markers of data quality. Systematic labeling errors — for example, if a dataset was labeled by annotators who hold implicit biases — will produce a model that encodes those biases. Missing data that is non-randomly distributed (if data is missing more often for certain demographic groups) can introduce bias even when the training data otherwise appears representative.

Provenance examination investigates where the training data came from and whether its collection was ethical. Did individuals consent to their data being used for AI training? Were collection practices compliant with applicable privacy laws? Were the data collection agreements accurate about how data would be used? Datasheets for Datasets (Gebru et al., 2018) provide a standardized documentation format for recording data provenance and characteristics — analogous to nutritional information labels for food.

Model Auditing

Model auditing examines the AI system itself: its performance, its fairness properties, and its vulnerabilities.

Accuracy and performance metrics across demographic groups. This is the most commonly required component of bias auditing under laws like NYC LL 144. The auditor computes performance metrics — accuracy, precision, recall, false positive rate, false negative rate — separately for different demographic groups and examines whether they differ significantly. For a hiring screening tool, this means computing selection rates for different racial, gender, and age groups. For a recidivism prediction tool, this means computing false positive rates (incorrectly labeling someone as high-risk) separately by race.

Fairness metric calculation. Multiple quantitative fairness metrics can be applied to an AI system, and they can give different verdicts on the same system (a point developed at length in Chapter 9). Demographic parity asks whether the system produces the same positive outcome rate across groups. Equalized odds asks whether the system produces the same true positive rate and false positive rate across groups. Predictive parity asks whether the system's positive predictions are correct at the same rate across groups. Because these metrics cannot all be satisfied simultaneously (when base rates differ across groups), auditors must specify which metrics are most appropriate for the specific context and why.

Feature importance analysis. Machine learning models use input features to make predictions, and the features that drive predictions can be examined through feature importance methods. Auditors should examine whether proxy variables — features that are correlated with protected characteristics like race or gender — are among the most important predictors. If ZIP code is a top predictor of credit scores, and ZIP code is correlated with race due to historical housing segregation, then the model is effectively using race as a predictor, even if race is not an explicit input.

Adversarial testing. Auditors should test how the model behaves when inputs are systematically manipulated. This includes testing with adversarial examples designed to fool the model, testing for model behavior on inputs from the tails of the distribution (edge cases), and testing for the model's sensitivity to small input changes (which should not produce dramatically different outputs for similar cases).

Model cards (Mitchell et al., 2019) are structured documentation templates for AI models, covering intended use, performance metrics across demographic groups, evaluation data, training data, and known limitations. They serve an accountability function: requiring model cards forces developers to document what they know about their model's behavior and limitations.

Output Auditing

Output auditing examines the real-world effects of an AI system's decisions on affected populations.

Disparate impact analysis examines whether the AI system's outputs produce systematically different outcomes for different demographic groups. For a hiring system, this means comparing selection rates by race, gender, and age. For a lending system, this means comparing loan approval and denial rates across racial groups, controlling for creditworthiness. For a content recommendation system, this means examining whether certain types of content are disproportionately recommended to or suppressed from certain demographic groups.

Longitudinal tracking monitors performance over time to detect drift — the degradation of AI system performance as the deployment environment changes. A model that was fair and accurate at deployment may become less fair or accurate as the world changes. Continuous monitoring is required to detect drift before it has caused significant harm.

Outcome data analysis is the most demanding form of output auditing: it examines actual outcomes for individuals affected by AI decisions, not just the AI's outputs. Did people who received higher credit scores from the AI system actually have lower default rates? Did employees who were flagged as high-flight-risk by the AI system actually leave? Did patients who were identified as high-risk by the AI diagnostic system actually have the relevant condition? Outcome data analysis requires data that is often not collected, and it requires matching AI outputs to subsequent real-world outcomes — a technically challenging and often organizationally impractical requirement.

Auditing Tools

Several open-source tools support technical AI auditing:

AI Fairness 360 (IBM) is a comprehensive Python toolkit that implements dozens of fairness metrics and bias mitigation algorithms. It is widely used in research and practice and supports both pre-processing (addressing bias in training data), in-processing (building fairness constraints into the model), and post-processing (adjusting model outputs) approaches.

Fairlearn (Microsoft) is an open-source Python toolkit focused on fairness assessment and mitigation, with particular strength in interactive visualization of fairness-accuracy tradeoffs.

Aequitas (University of Chicago Center for Data Science and Public Policy) is a bias and fairness audit toolkit designed specifically for public policy AI applications, with particular emphasis on the criminal justice context.

What-If Tool (Google) is an interactive visualization tool for examining model behavior across demographic groups, supporting both technical and non-technical users.

Audit AI (pymetrics) is a Python package designed specifically for auditing AI hiring tools, focusing on adverse impact analysis.

Section 19.5: External Auditing — The Independence Problem

Why External Auditing Matters

The case for external auditing rests on the fundamental insight of the audit function: organizations have conflicts of interest when evaluating their own practices. A company that has invested in building and deploying an AI system has powerful incentives to conclude that the system performs adequately and does not cause harm. These incentives — financial, reputational, organizational — are not inherently corrupt, but they are structural, and they are sufficient to make internal assessments unreliable guides to external stakeholders, regulators, and the public.

External auditing addresses this by interposing an independent professional between the audited organization and the audience for the audit results. The external auditor owes a professional duty to the public (or the regulatory body), not to the organization being audited. She has no financial interest in favorable findings and professional and legal consequences for deficient work. The credibility that external auditing provides is not merely reputational — it is structural, arising from the independence that the external auditor's institutional position creates.

What External Auditing Requires

Effective external AI auditing requires three things: access, expertise, and independence.

Access means the ability to examine the AI system, its training data, its development documentation, and the deployment context with enough depth to form a meaningful opinion. This is the central challenge: organizations resist giving external parties access to proprietary models and data. Where external access is not legally required, organizations typically provide auditors with documentation and limited testing access through deployment interfaces — which may be sufficient for a compliance audit but inadequate for a comprehensive technical audit.

Expertise means possessing the technical, domain, and legal knowledge necessary to understand and evaluate the AI system in its deployment context. An AI auditor who does not understand the statistical concepts underlying fairness metrics cannot adequately evaluate whether the system is fair. An auditor who does not understand the domain in which the system is deployed cannot adequately assess whether the system's outputs are appropriate. And an auditor who does not understand the legal requirements applicable to the system cannot adequately assess compliance.

Independence means freedom from financial and organizational relationships that could compromise the auditor's objectivity. Financial independence requires that the auditor not have a material financial interest in the audited organization and that audit fees be structured in ways that do not create incentives for favorable findings. Organizational independence requires that the auditor not report to or be overseen by the management of the audited organization.

The Access Problem in Practice

The access problem is the most pressing practical obstacle to effective external AI auditing. Organizations have three categories of concern about providing access:

First, intellectual property: the model architecture, training data, and development practices represent significant investment and competitive advantage. Sharing them with external auditors — even under confidentiality agreements — creates IP risk.

Second, litigation risk: an audit that finds problems creates documented evidence of those problems. If the problems are not fully remediated, the audit findings become plaintiff's evidence in subsequent litigation.

Third, strategic risk: if audit findings must be disclosed publicly (as required by NYC LL 144), unfavorable findings could damage the organization's reputation and competitive position.

These concerns are legitimate. Addressing them requires legal structures that protect audit confidentiality while still providing for meaningful external review and some form of public disclosure.

Emerging External AI Audit Firms

A small but growing number of firms specialize in external AI auditing:

BABL AI (Bias, Accountability, Benchmarks, and Leadership) conducts bias audits of employment AI tools, focusing on NYC LL 144 compliance. It was among the first firms to conduct public AI bias audits and has published audit reports for multiple employers.

O'Neil Risk Consulting and Algorithmic Auditing (ORCAA), founded by mathematician Cathy O'Neil (author of Weapons of Math Destruction), conducts external audits of algorithmic systems in multiple domains.

Parity AI conducts bias and fairness audits of AI systems, particularly in the employment context.

KPMG, Deloitte, PwC, and EY — the major financial audit firms — have all developed AI audit practices, recognizing that AI governance is becoming a significant component of corporate governance and regulatory compliance.

The emergence of these firms represents genuine progress, but the field is still nascent. Without professional credentialing requirements, mandatory disclosure obligations covering most AI systems, or liability consequences for deficient audits, the AI audit market lacks the institutional infrastructure that makes financial auditing effective.

The Financial Auditing Analogy

Financial auditing provides a useful — if imperfect — model for understanding what mature AI auditing could look like. Financial auditing was not always the institutional structure it is today. The requirement that public companies have independent external audits of their financial statements was first enacted in the Securities Exchange Act of 1934, in response to the accounting fraud that contributed to the stock market crash of 1929. The specific standards governing audit practice, and the institutional oversight of auditors by the PCAOB, were not established until the Sarbanes-Oxley Act of 2002, in response to the Enron and WorldCom accounting scandals.

The lesson of financial auditing history is that the institutional infrastructure for effective external auditing — standards, credentialing, liability, regulatory oversight — does not emerge spontaneously from market forces. It requires legislative and regulatory action, typically prompted by high-profile failures that demonstrate the consequences of inadequate auditing. AI auditing is at an earlier stage than financial auditing was in 1934. The failures that will prompt more comprehensive regulatory action are likely still occurring.

The "Audit Without Access" Approach

When model access is denied, auditors can still draw conclusions from output data alone. ProPublica's COMPAS investigation (discussed in Case Study 19-2) is the paradigmatic example: using court records to track criminal defendants' actual recidivism, matched to their COMPAS scores, ProPublica constructed an analysis of the system's racial disparities without ever accessing the COMPAS model or its training data.

This approach has significant advantages: it does not require the cooperation of the organization whose system is being audited, and it examines actual outcomes rather than model behavior in controlled testing conditions. But it also has limitations: it cannot identify the source of disparities (whether they arise from the model, the training data, or deployment practices), it requires access to outcome data that may itself be difficult to obtain, and it is subject to the critique that observed disparities reflect real-world differences rather than model bias.

Section 19.6: Regulatory Requirements for AI Auditing

NYC Local Law 144

New York City's Local Law 144, effective January 1, 2023, represents the first mandatory AI bias audit requirement in the United States. The law applies to employers in New York City who use "automated employment decision tools" — defined as computational processes used to screen resumes, score candidates, or make hiring recommendations — in employment decisions affecting New York City jobs.

The law's core requirements are:

Bias audit obligation. Before using an AEDT, and annually thereafter, employers must have an independent bias audit of the tool conducted. The audit must be conducted by a "qualified auditor" who is not employed by the tool's vendor or the employer.

Disclosure obligation. Employers must post the results of the bias audit on their website, including the selection rates and impact ratios calculated by the audit.

Notice obligation. Before using an AEDT to assess a candidate, employers must notify candidates that an automated tool is being used and describe the tool's data collection.

Candidate accommodation. Employers must accommodate candidates who request an alternative selection process that does not use the AEDT.

NYC LL 144 has been influential despite its limitations. The New York City Department of Consumer and Worker Protection published implementing rules in 2023 that defined what a bias audit must include: calculation of selection rates for each race/ethnicity, gender, and intersectional category; impact ratios comparing each group to the highest-scoring group; and disclosure of these calculations on the employer's website.

Early implementation has been uneven. Many employers initially failed to comply, and enforcement was slow. Critics noted that the law's definition of bias audit was narrow (focused on selection rate disparities) and did not require examination of whether the disparities resulted from genuine performance differences or from proxy discrimination. Some auditors certified compliance on the basis of analyses that did not meaningfully test for bias.

EU AI Act: Conformity Assessments

The EU AI Act (2024) creates a comprehensive framework for AI auditing through the mechanism of "conformity assessments." High-risk AI systems — including AI used in employment, education, credit decisions, law enforcement, and several other categories — must undergo conformity assessments before deployment.

For most high-risk AI systems, the conformity assessment can be conducted internally — through the organization's own quality management system — provided that the organization has implemented the AI Act's requirements for technical documentation, risk management, data governance, and human oversight. For AI systems in the most sensitive categories (AI systems used in biometric identification, law enforcement predictive policing, and critical infrastructure), third-party conformity assessment by a notified body (an EU-accredited certification organization) is required.

The EU AI Act's conformity assessment framework represents a more comprehensive approach than NYC LL 144: it covers a wider range of AI applications, it requires assessment against a broader set of standards (not just bias metrics), and it integrates assessment into a mandatory registration system. But critics note that allowing self-assessment for most high-risk AI systems limits the independence that makes external auditing valuable.

Financial Services: SR 11-7

Long before AI auditing was a recognized practice, the Federal Reserve's Supervisory Guidance on Model Risk Management (SR 11-7), issued in 2011, established mandatory model validation requirements for financial institutions using quantitative models in risk management. SR 11-7 requires that models be independently validated — reviewed by a function independent of the development team — before deployment and on an ongoing basis.

SR 11-7 applies to credit scoring models, risk models, and increasingly to AI-based models used in financial services. It is the most established regulatory framework for AI auditing in the United States and provides a template for what mandatory audit requirements can look like in practice: clear scope, independence requirements, minimum content standards, and enforcement through the bank examination process.

EEOC UGESP and AI Hiring Tools

The Equal Employment Opportunity Commission's Uniform Guidelines on Employee Selection Procedures (UGESP), in effect since 1978, require employers to maintain records of the adverse impact of their selection procedures by race, gender, and national origin — and to conduct validity studies justifying any procedure that produces adverse impact. These requirements apply to AI hiring tools: an employer who uses an AI screening tool must track its adverse impact by demographic group and must be able to justify the tool's validity if adverse impact is found.

The EEOC has made clear that UGESP applies to AI hiring tools and that employers are responsible for validation regardless of whether the tool was built by a third party. This creates a de facto continuous monitoring and periodic validation requirement for AI hiring tools used by covered employers.

Section 19.7: Auditing Generative AI and Foundation Models

The New Challenge

The emergence of large language models and other generative AI systems has created an auditing challenge that the frameworks developed for earlier AI systems are not well-equipped to handle. Traditional AI auditing was designed for systems with well-defined tasks and bounded outputs: a credit-scoring model takes defined inputs and produces a numerical score; a hiring screening tool takes resume text and produces a ranking. These systems can be audited by testing their behavior systematically across a defined test set and measuring performance against specified metrics.

Generative AI systems — large language models capable of producing open-ended text, image generators capable of producing images in arbitrary styles — have emergent behaviors that are not fully specified at design time and cannot be completely enumerated in advance. A large language model can produce outputs across an essentially infinite space of possible outputs, making it impossible to audit comprehensively in the traditional sense. This requires different auditing methodologies.

Red-Teaming

Red-teaming — borrowed from military and cybersecurity practice — is the most widely used methodology for auditing generative AI. In AI red-teaming, a team of researchers is given access to the AI system with the task of finding harmful behaviors: producing content that promotes violence, generates disinformation, reveals personal information about real individuals, assists with chemical or biological weapons development, or violates other safety guardrails. The red team probes the system systematically, trying to find inputs that elicit harmful outputs.

Red-teaming is valuable because it applies human creativity and adversarial thinking to the problem of finding failure modes that systematic testing would miss. It is limited because it is inherently incomplete: no red team can probe all possible inputs, and adversarial actors with motivation to find harmful behaviors may be more persistent and creative than red team members. Red-teaming results are also sensitive to the composition of the red team — teams that do not include diverse perspectives will be less effective at finding harms that disproportionately affect underrepresented communities.

Capability Evaluations

Capability evaluations assess what a model can and cannot do — with particular attention to high-risk capabilities that could enable harm at scale. The most important capability evaluations focus on: the ability to assist with weapons of mass destruction development; the ability to conduct cyberattacks; the ability to engage in sophisticated social manipulation at scale; and the ability to operate autonomously in ways that could circumvent human oversight.

Capability evaluations are becoming a component of regulatory frameworks for advanced AI. The UK's AI Safety Institute (now AI Security Institute) and the U.S. AI Safety Institute have both developed evaluation frameworks for advanced AI models. The EU AI Act requires capability evaluations for general-purpose AI models above a defined capability threshold, measured in training compute (10^25 floating-point operations).

The "Eval" Ecosystem

Several organizations have developed standardized evaluation frameworks for AI models:

OpenAI Evals is an open-source framework for evaluating large language models, with a registry of evaluation tasks contributed by the community. It supports both automated evaluation (using standard benchmark datasets) and human evaluation (rating model outputs on specified criteria).

Anthropic's Constitutional AI evaluation framework evaluates models for alignment with specified values and principles, using both human feedback and AI-assisted evaluation.

METR (Machine Intelligence Research for AI Evaluation and Measurement) develops evaluations for dangerous and concerning capabilities of advanced AI models, with a focus on long-horizon autonomous task completion.

The standardization represented by these frameworks is genuine progress. But standardized benchmarks can be gamed: organizations can optimize models to perform well on specific benchmarks without improving their actual safety properties. This "benchmark overfitting" problem requires that evaluation frameworks be regularly updated and that benchmark details remain partially confidential.

Section 19.8: Building an Organizational AI Audit Function

Where Audit Fits in the Governance Structure

The organizational design of an AI audit function requires careful attention to independence: an audit function that reports to the same management that oversees AI development has conflicts of interest that undermine its value. Effective audit independence requires that the AI audit function report to a party who is not directly responsible for AI development outcomes — ideally to the board's audit committee or to the Chief Risk Officer, with clear authority to escalate findings without management filtering.

In larger organizations, the AI audit function may be part of a broader internal audit department, with AI-specific expertise developed within the audit team. In smaller organizations, AI auditing may be conducted by a combination of internal technical experts and external auditors, with the internal function focused on continuous monitoring and the external function focused on periodic comprehensive assessment.

The audit committee of the board of directors should have oversight responsibility for AI audit findings. This board-level oversight is essential for high-risk AI deployments: if AI audit findings can be filtered by management without reaching the board, the governance value of the audit function is severely limited.

Audit Scope and Frequency

Not all AI systems require the same depth or frequency of audit. A risk-based approach to audit scope and frequency allocates audit resources in proportion to the stakes of the AI system being audited:

High-risk AI systems (those making or materially influencing high-stakes decisions affecting individuals' rights, economic opportunities, or physical safety) should be audited comprehensively before deployment, audited externally at least annually, and monitored continuously. This category includes AI systems used in employment, credit, healthcare, housing, and criminal justice.

Medium-risk AI systems (those making operational decisions with significant but more limited impact on individuals) should be audited before deployment and reviewed internally on an annual or semi-annual basis, with external audit every two to three years.

Low-risk AI systems (those making recommendations or automating tasks with limited individual impact) may be assessed through internal monitoring with periodic review, without mandatory external audit.

Documentation Requirements

The AI audit function must establish and maintain documentation that enables after-the-fact accountability: the ability to reconstruct what an AI system was doing, how it was performing, and what decisions were made about it at any given time. Documentation requirements include: model cards for all deployed AI systems; records of training data provenance and quality assessment; records of pre-deployment testing and impact assessment; records of ongoing monitoring results; records of any incidents, complaints, or identified problems; and records of decisions to continue, modify, or discontinue AI systems.

Vendor Audit Clauses

Organizations that use third-party AI systems should negotiate contractual rights to audit those systems as a component of procurement due diligence. Vendor audit clauses should specify: the right to conduct (or commission) technical audits of the AI system; the right to access documentation of the system's training data, validation methodology, and known limitations; notification obligations when the vendor makes material changes to the system; and the vendor's obligation to cooperate with regulatory audits.

Without contractual audit rights, deploying organizations are dependent on vendor representations about system performance — a dependency that the EU AI Act's transparency requirements for AI providers are designed to reduce, but that remains a significant gap in U.S. AI governance.

The NIST AI RMF

The National Institute of Standards and Technology's AI Risk Management Framework (AI RMF), published in 2023, provides a comprehensive voluntary framework for organizational AI risk management that integrates auditing into a broader governance lifecycle. The AI RMF's four core functions — Govern, Map, Measure, and Manage — each have implications for organizational AI auditing:

Govern establishes the organizational structures, policies, and accountability mechanisms for AI risk management — including the audit function's place in the governance structure.

Map identifies and categorizes AI systems and their associated risks — the foundation for risk-based audit scope and frequency decisions.

Measure implements quantitative and qualitative assessments of AI system performance, risks, and impacts — the technical audit function.

Manage responds to identified risks through mitigation, monitoring, and corrective action — the post-audit governance function.

Integrating the AI audit function with the NIST AI RMF provides a comprehensive framework for organizational AI governance that connects auditing to the broader accountability lifecycle.

Discussion Questions

NYC Local Law 144 requires bias audits only for employment AI tools, measured only through selection rate disparities. Critics argue this is too narrow: it doesn't cover outcome quality (are selected candidates actually more successful?), it doesn't cover other domains, and it focuses on a single metric that can be gamed. Proponents argue that even narrow requirements represent meaningful progress. Who is right, and what should a more comprehensive approach look like?
The access problem — organizations' reluctance to give external auditors access to proprietary models and training data — is the central practical obstacle to effective external AI auditing. Propose three specific mechanisms (legal, contractual, or technical) that could provide external auditors with meaningful access while adequately protecting legitimate IP and confidentiality interests.
The chapter compares AI auditing to financial auditing, noting that financial audit's institutional infrastructure took decades of crises and legislation to develop. Is this historical trajectory inevitable for AI auditing, or are there ways to accelerate the development of effective AI audit standards and institutions?
Red-teaming is widely used to identify harmful capabilities in generative AI models. What are its strengths and limitations as an auditing methodology? How would you design a more comprehensive audit framework for a large language model deployed in a consumer context?
The chapter notes that auditing creates a perverse incentive: organizations that audit and find problems are worse off in litigation than organizations that never audit. How would you design a legal framework that preserves the incentive to audit while protecting organizations from being penalized for good-faith auditing?
The NIST AI RMF is voluntary. EU AI Act conformity assessments are mandatory. What specific features of the EU approach make it more likely to produce genuine accountability? What costs does mandatory assessment impose that voluntary frameworks avoid?
An AI audit requires diverse perspectives to identify harms that affect underrepresented communities. But diverse audit teams may lack the technical expertise of homogeneous technical teams. How would you design an audit team and process that achieves both diversity and technical rigor?

Cross-references: Chapter 3 (accountability frameworks); Chapter 9 (COMPAS, fairness metrics); Chapter 18 (accountability — who bears responsibility for audit failures); Chapter 20 (liability — legal consequences of audit findings); Chapter 33 (EU AI Act — regulatory context for audit requirements).