Case Study: Google's Model Cards in Practice

DataField.Dev

Case Study: Google's Model Cards in Practice

"The purpose of a model card is not to prove a model is good. It is to make visible the conditions under which a model might fail." -- Margaret Mitchell, lead author of "Model Cards for Model Reporting"

Overview

In 2019, Margaret Mitchell and colleagues at Google published "Model Cards for Model Reporting," a paper that proposed a structured documentation framework for machine learning models. The paper was both a scholarly contribution and a practical tool: it proposed that every trained model should be accompanied by a "model card" documenting its intended use, performance characteristics, limitations, and ethical considerations. Google subsequently implemented model cards for some of its public-facing AI systems, making it one of the first major technology companies to adopt the framework at scale. This case study examines the model card framework's design, Google's implementation, the framework's adoption beyond Google, and the unresolved tensions between documentation and accountability.

Skills Applied: - Evaluating the model card framework against the responsible AI principles from Section 29.2 - Analyzing the gap between documentation and operational accountability - Assessing disaggregated performance reporting as a fairness tool - Connecting documentation practices to the broader responsible AI ecosystem

The Framework

The Original Paper

Mitchell et al. (2019) proposed model cards as "short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups and intersectional groups." The paper was motivated by two observations:

Models are deployed without adequate documentation. Most ML models are shared, published, or deployed with only aggregate performance metrics. Users have no way to evaluate whether the model will work in their specific context.
Aggregate metrics hide disparate performance. A model with 95% overall accuracy may perform at 99% for one demographic group and 82% for another. Without disaggregated reporting, these disparities are invisible.

The proposed model card format included:

Section	Purpose
Model details	Name, version, type, author, date, license
Intended use	Primary intended use cases, primary users, out-of-scope uses
Factors	Relevant factors for model performance (demographic groups, environments, instrumentation)
Metrics	Performance metrics chosen and why, along with decision thresholds
Evaluation data	Datasets used for evaluation (with datasheets or descriptions)
Training data	Description of training data (ideally with a datasheet)
Quantitative analyses	Disaggregated performance across factors
Ethical considerations	Known ethical issues, risks, and impacts
Caveats and recommendations	Additional concerns and recommended actions

The Key Innovation: Disaggregated Reporting

The model card's most significant contribution was normalizing disaggregated performance reporting. Rather than reporting a single accuracy number, model cards present performance broken down by relevant demographic and contextual factors.

For facial recognition, this meant reporting accuracy separately by gender, skin tone, and intersectional categories (e.g., dark-skinned women vs. light-skinned men). For natural language processing, this meant reporting performance separately by dialect, language variety, or topic domain. For medical AI, this meant reporting separately by patient demographics, clinical settings, and disease subtypes.

Disaggregated reporting does not solve bias -- it makes bias visible. A model card that shows 99% accuracy for light-skinned males and 82% accuracy for dark-skinned females does not fix the disparity. But it makes the disparity impossible to ignore, creating pressure for improvement and enabling informed deployment decisions.

Google's Implementation

Published Model Cards

Google published model cards for several of its public-facing ML systems, including:

Face detection model (2020). Google published a model card for its face detection model (used in Google Photos and other products) that included disaggregated performance across skin tone categories (using the Fitzpatrick skin type scale), age groups, and head pose variations. The card documented that the model's detection rate was lower for very dark skin tones and for extreme head angles -- limitations that would be invisible in an aggregate accuracy report.

Toxicity detection (Perspective API). Google's Jigsaw team published a model card for the Perspective API, which detects toxic comments in online discussions. The card documented that the model had higher false positive rates for comments containing identity terms (e.g., "gay," "Muslim," "Black") because the training data contained toxic comments directed at these groups. The model had learned to associate the identity terms themselves with toxicity -- a form of bias that the model card made visible.

Speech recognition. Model cards for speech recognition systems documented performance differences across accents, languages, and background noise conditions. Systems trained primarily on standard American English performed less accurately for speakers with non-American accents -- a disparity documented in the model card.

Adoption and Integration

Google integrated model cards into several of its AI development processes:

TensorFlow Model Garden: Google's repository of pre-trained models includes model cards for many published models.
Model Cards Toolkit: Google released an open-source toolkit for generating model cards programmatically, available through TensorFlow.
Internal use: Some Google product teams adopted model cards for internal AI systems, though the extent of internal adoption is not fully public.

The Broader Adoption

Beyond Google

The model card framework has been adopted and adapted by organizations across the AI ecosystem:

Hugging Face. The AI model hub Hugging Face adopted model cards as a standard documentation format for the thousands of models hosted on its platform. Every model uploaded to Hugging Face Hub is expected to include a model card. In practice, the quality varies enormously -- some model cards are comprehensive, while many are perfunctory or missing entirely.

Microsoft. Microsoft incorporated model card concepts into its Responsible AI Dashboard and documentation standards.

Government agencies. The Canadian government's Algorithmic Impact Assessment framework includes model documentation requirements inspired by the model card format.

Academic research. The model card format has become increasingly common in academic ML publications, with some conferences and journals encouraging or requiring model cards for submitted models.

The Quality Problem

Widespread adoption has revealed a quality problem: having a model card is not the same as having a good model card. Many published model cards are:

Incomplete: Missing critical sections, particularly ethical considerations and disaggregated metrics
Perfunctory: Filling in sections with minimal, generic content rather than thoughtful analysis
Stale: Created at publication and never updated as the model evolves or new issues emerge
Disconnected: Existing as standalone documents with no connection to monitoring, audit, or governance processes

A 2022 study by Liang et al. analyzed model cards on Hugging Face and found that the majority lacked disaggregated performance data, ethical considerations, or meaningful limitations documentation. The model card format had been adopted, but its substance was frequently absent.

The Unresolved Tensions

Documentation vs. Accountability

The fundamental tension in the model card framework is the gap between documentation and accountability. A model card can honestly document that a model performs poorly for certain demographic groups -- but it cannot compel the developers to fix the problem. The Perspective API model card documented the identity-term bias, but the system was deployed with the bias present (and partially mitigated through post-processing).

Is documentation sufficient, or must it be connected to governance mechanisms that require action when problems are identified? The original paper did not address this question directly. The chapter's governance frameworks (Chapters 26-28) suggest that documentation without accountability is necessary but insufficient.

Voluntary vs. Mandatory

Model cards are currently voluntary. No regulation requires them (though the EU AI Act's documentation requirements for high-risk systems are similar in spirit). The voluntary nature means that the organizations most likely to produce good model cards are the organizations least likely to need them -- those already committed to responsible AI. Organizations with problematic models and no accountability culture simply do not publish cards.

The Audience Problem

Model cards face a fundamental audience challenge. A card written for ML engineers may be incomprehensible to affected communities. A card written for the general public may lack the technical detail needed for meaningful review. The original paper acknowledged this tension but did not resolve it. In practice, most model cards are written for technical audiences, leaving affected communities without accessible documentation.

The Author's Departure

The tension between model documentation and corporate practice became acute when Margaret Mitchell, the lead author of the model cards paper, was fired by Google in February 2021. Mitchell had co-led Google's Ethical AI team alongside Timnit Gebru (who was terminated in December 2020). The model cards framework was designed to make AI systems transparent and accountable -- values that, Mitchell and Gebru argued, Google's own practices did not consistently embody.

The irony is sharp: the person who created one of the most influential tools for responsible AI documentation was dismissed by the company where she developed it, in part over disputes about the company's commitment to the principles the tool was designed to serve.

Analysis Through Chapter Frameworks

Connection to the Documentation Pipeline

The model card is the final stage in the documentation pipeline described in Section 29.4.1: datasheet (data) -> quality audit (integrity) -> lineage tracker (provenance) -> model card (model). Google's implementation partially realized this pipeline -- model cards reference training data descriptions and evaluation datasets. But the full pipeline, including lineage tracking and quality auditing, is not standard practice even at Google.

Connection to Ethics Governance

The model card framework works best when connected to organizational governance. VitraMed's model card (Section 29.4.2) includes a review_status field documenting ethics committee approval. This connection -- between documentation and governance -- is what gives the model card operational significance. Without governance, a model card is a report that nobody acts on.

Discussion Questions

The quality problem. How can the model card framework be designed to encourage substantive rather than perfunctory documentation? Should there be minimum quality standards? Who would enforce them?
The accountability gap. Is it sufficient to document that a model has disparate performance, or should documentation be connected to requirements for remediation? What would a "model card with teeth" look like?
The audience problem. Propose a multi-layered model card format that serves technical reviewers, regulators, and affected communities simultaneously. What information does each audience need, and how should it be presented?
The Mitchell/Gebru departures. What does it mean for responsible AI when the researchers who create accountability tools are dismissed by the organizations that deploy them? How does this affect the credibility of corporate responsible AI commitments?
Mandatory model cards. Should model cards be legally required for all AI systems, for high-risk systems only, or remain voluntary? Make the case for your preferred approach.

Your Turn: Mini-Project

Option A: Model Card Review. Find three published model cards (Hugging Face, TensorFlow Model Garden, or company publications). Evaluate each against the Mitchell et al. (2019) framework. Which sections are well-completed? Which are missing or perfunctory? Write a comparative assessment.

Option B: Create a Model Card. Select an ML model you have built or interacted with. Create a comprehensive model card using the ModelCard dataclass from Section 29.4. Include all sections, with particular attention to ethical considerations, limitations, and disaggregated metrics.

Option C: Accessible Model Card. Take a technical model card (from Hugging Face or Google) and rewrite it for a non-technical audience. Your version should convey the same information in language accessible to someone affected by the model's decisions but without ML training. Reflect on what was easy and hard to translate.

References

Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. "Model Cards for Model Reporting." Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT)*, 2019, 220-229.
Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daume III, and Kate Crawford. "Datasheets for Datasets." Communications of the ACM 64, no. 12 (2021): 86-92.
Liang, Weixin, et al. "What's Documented in AI? Systematic Analysis of 32K AI Model Cards." arXiv preprint arXiv:2402.05160, 2024.
Google. "Model Cards." Google AI, 2020. ai.google/responsibilities/model-cards.
Jigsaw. "Perspective API Model Card." Jigsaw/Google, 2020.
Hugging Face. "Model Cards Documentation." huggingface.co/docs/hub/model-cards.
Simonite, Tom. "What Really Happened When Google Ousted Timnit Gebru." Wired, June 8, 2021.
Raji, Inioluwa Deborah, et al. "Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing." FAT 2020, 33-44.