Chapter 19: Key Takeaways — Auditing AI Systems

Core Concepts

  1. AI Auditing Is a New and Rapidly Developing Field. Unlike financial auditing, which has a century of standards, professional credentialing, and regulatory infrastructure, AI auditing has no universal standards, no recognized credential, and no mandatory disclosure regime covering most AI applications. The practices, institutions, and standards being established now will shape AI auditing for decades.

  2. Four Types of Audit Address Different Questions. Technical audits assess model performance and fairness across demographic groups. Process audits assess development and deployment practices. Impact audits assess real-world outcomes for affected populations. Compliance audits assess adherence to legal requirements. An effective AI governance framework requires all four, integrated across the AI system's lifecycle.

  3. The Access, Expertise, and Independence Problems Are the Central Challenges. External auditors need access to models and training data (which organizations protect as IP), technical expertise to evaluate complex systems (which is rare and expensive), and independence from the organizations they audit (which is compromised by financial relationships). Without all three, external AI auditing provides limited assurance.

  4. Pre-Deployment Assessment Is the Most Important Moment. Impact assessments conducted before deployment allow identification and mitigation of potential harms before they occur at scale. Post-deployment auditing is necessary but reactive; pre-deployment assessment is proactive. The Collingridge dilemma makes pre-deployment action especially valuable, because problems are more tractable before systems are entrenched.

  5. Data Auditing Is as Important as Model Auditing. The performance of any AI system is constrained by its training data. Data auditing — for representativeness, quality, and ethical provenance — is the foundation of meaningful technical auditing. Documentation standards like Datasheets for Datasets and model cards provide the information necessary for effective auditing.

  6. Fairness Metric Selection Is a Normative, Not Technical, Judgment. Different fairness metrics — demographic parity, equalized odds, predictive parity — can give contradictory verdicts on the same system, and when base rates differ across groups, it is mathematically impossible to satisfy all of them simultaneously. Choosing which metric matters is a value judgment about whose errors count most, not a purely technical question.

  7. External Auditing Is Possible from Output Data Alone. The ProPublica COMPAS investigation demonstrated that meaningful external assessment of an AI system's fairness properties is possible using publicly available outcome data, without model access. This approach has limitations but demonstrates that accountability is not impossible simply because model access is denied.

  8. NYC Local Law 144 Established a Principle, Not a Complete Solution. LL 144 established the important principle that AI employment tools must be independently audited and publicly disclosed. But its narrow scope, limited audit standards, absence of validity requirements, and weak enforcement have produced uneven compliance and variable audit quality. More comprehensive audit requirements are needed.

  9. Auditing Generative AI Requires Different Methods. Traditional performance and fairness auditing is designed for AI systems with defined tasks and bounded outputs. Generative AI systems require red-teaming, capability evaluation, and horizon scanning — methods borrowed from cybersecurity and applied to the distinct challenge of assessing systems with emergent and potentially harmful behaviors.

  10. Internal Audit Functions Require Genuine Independence to Provide Value. An internal AI audit function that reports to the management overseeing AI development has conflicts of interest that limit its value. Effective internal audit requires reporting to the board audit committee or an independent risk function, with authority to escalate findings without management filtering.

Key Terms to Know

  • AI audit: systematic examination of AI system design, development, deployment, and outcomes
  • Bias audit: audit focused on demographic disparities in AI system outputs
  • Algorithmic impact assessment (AIA): pre-deployment assessment of potential harms to affected populations
  • Technical audit: examination of model performance, accuracy, and fairness metrics
  • Impact audit: assessment of real-world outcomes for affected populations
  • Conformity assessment: EU AI Act mechanism for evaluating high-risk AI compliance
  • Red-teaming: structured adversarial testing to find harmful AI behaviors
  • Model card: standardized documentation template for AI model performance and limitations
  • Datasheets for Datasets: standardized documentation template for AI training data
  • Demographic parity: fairness metric requiring equal positive outcome rates across groups
  • Equalized odds: fairness metric requiring equal true and false positive rates across groups
  • Calibration: fairness metric requiring that the same score predicts the same probability across groups
  • SR 11-7: Federal Reserve guidance requiring model validation for financial institutions

Connections to Other Chapters

  • Chapter 9: Fairness metrics and their mathematical incompatibility — essential technical foundation for audit metrics
  • Chapter 18: Accountability — who bears responsibility when audits fail or are not conducted
  • Chapter 20: Liability — legal consequences of audit findings and audit failures
  • Chapter 30: COMPAS — audit of criminal justice AI
  • Chapter 33: EU AI Act — comprehensive regulatory framework for AI auditing