Case Study 7.2: Racial Bias in Facial Recognition

The NIST Findings and Their Implications


Overview

System: Multiple commercial facial recognition systems evaluated by NIST Evaluation: Face Recognition Vendor Technology (FRVT) program, ongoing since 2014; most relevant report published December 2019 Key finding: False positive rates 10–100 times higher for Black and Asian faces than for white faces in one-to-one verification tasks Real-world consequences: Documented wrongful arrests of Black individuals based on facial recognition matches Status: Ongoing use by law enforcement despite documented disparities; some municipal bans; voluntary sales pauses by major vendors


1. Why NIST Tested Facial Recognition Systems

The National Institute of Standards and Technology occupies a distinctive role in the American technology ecosystem. NIST's mission is to develop measurement standards and evaluate technologies — it does not regulate or prohibit, but its evaluations carry significant authority because they are conducted with methodological rigor and without commercial interests.

NIST began systematic evaluation of facial recognition systems through its Face Recognition Vendor Technology (FRVT) program in the early 2000s, motivated by two parallel forces. First, the rapid commercial expansion of facial recognition technology, driven by advances in deep learning from approximately 2012 onward, had produced systems with dramatically improved performance claims that had not been independently verified. Second, growing law enforcement interest in using these systems for identification tasks — matching a probe face against a database of known individuals — created urgent public interest in understanding how well these systems actually worked.

By 2019, NIST had obtained access to more than 100 facial recognition algorithms from vendors including the largest global players in the market. The 2019 evaluation, titled "Face Recognition Vendor Technology (FRVT) Part 3: Demographic Effects," was specifically designed to examine whether these systems performed equally well across demographic groups. The answer was an unambiguous no — and the magnitude of the disparities was larger than most observers had anticipated.


2. What the FRVT Study Found: Unequal False Positive Rates

The NIST evaluation examined two primary types of facial recognition tasks:

One-to-one verification involves comparing a probe face image to a single stored reference image to determine whether they depict the same person. This is the task performed by smartphone face-unlock systems, airport biometric gates, and identity verification systems. The key error metric is the false positive rate: how often does the system incorrectly identify two different people as the same person?

One-to-many identification involves comparing a probe face to a database of many stored faces to find a match. This is the task performed in law enforcement contexts — comparing a surveillance image to a database of known individuals. The false positive implications are even more severe here, because a false positive means falsely identifying an innocent person as matching someone in the database.

For one-to-one verification, NIST found that the false positive rates for Black and Asian faces were 10 to 100 times higher than for white faces, depending on the algorithm and the demographic group. The effect was consistent across the vast majority of the more than 100 algorithms tested. It was not a flaw in a small number of systems; it was a systematic characteristic of the technology across the industry.

Asian faces showed false positive rates that were, in many algorithms, even higher than those for Black faces in one-to-one verification tasks. Native American faces showed some of the highest false positive rates of any group in the NIST data. The disparities for women were generally smaller than those for racial and ethnic minorities, but still present.

African American females showed particularly high false positive rates, consistent with the intersectionality findings from the Gender Shades research. The most adverse performance was concentrated at the intersection of race and gender, precisely as Buolamwini's framework predicted.

NIST also examined false negative rates — how often does the system fail to match two images of the same person? For these, the disparities also existed but were different in pattern. The false positive disparities were the more consequential finding for most use cases, because false positives are the mechanism through which innocent people can be wrongly implicated.


3. The One-to-Many Matching Problem: Higher Disparities, Higher Stakes

The disparities in one-to-many identification tasks — the law enforcement use case — were even more severe than those in one-to-one verification, and the stakes were proportionally higher.

In one-to-many identification, a system is searching a database of potentially millions of images to find a match for a probe face. The probability of a false positive increases with database size: the more faces in the database, the more opportunities for a spurious match between the probe and some other person in the database. For a system with a false positive rate that is already 10 to 100 times higher for Black faces than for white faces, database-scale search amplifies this disparity dramatically.

The practical implication is this: a Black suspect in a law enforcement facial recognition search is substantially more likely to be falsely matched to someone in the database than a white suspect, simply because of differential system accuracy. This disparity is not a product of any difference in the actual evidence; it is an artifact of the technology's differential performance.

Law enforcement agencies using these systems were, by and large, not accounting for this disparity in their use protocols. A facial recognition "hit" was often treated as significant investigative evidence regardless of the technology's known accuracy disparities across demographic groups — creating conditions for systematic false accusation of Black individuals at rates higher than white individuals.


4. Response from Facial Recognition Companies

The NIST findings, combined with the political moment of the summer of 2020 following the killing of George Floyd and the subsequent national conversation about policing and racial justice, prompted several major technology companies to announce voluntary pauses or changes to their facial recognition practices.

IBM announced in June 2020 that it would no longer offer general-purpose facial recognition products and would no longer research or develop facial recognition technology. IBM's chief executive wrote directly to Congress, calling for a national conversation about whether and how facial recognition should be used by law enforcement.

Microsoft announced a pause on sales of facial recognition technology to police departments and called for federal regulation before resuming such sales. Microsoft stated that it did not believe it was appropriate to sell technology with known accuracy disparities to law enforcement for use in high-stakes identification tasks without a legal framework governing its use.

Amazon announced a one-year moratorium on police use of its Rekognition facial recognition system in June 2020, subsequently extended. Amazon had previously defended Rekognition against criticism, commissioning research that disputed earlier accuracy analyses, and continued to sell the technology to private sector customers throughout the moratorium period.

These voluntary actions were significant and widely praised by advocates. They were also limited: they were temporary, restricted to specific use cases, and unilateral decisions by individual companies rather than the product of regulation or binding accountability mechanisms. Several other major facial recognition vendors — including Clearview AI, which had scraped billions of images from social media platforms to build a law enforcement database — continued operations without similar pauses.


5. The Case of Robert Williams: Wrongful Arrest in Detroit, 2020

The statistical disparities documented by NIST became human reality in the case of Robert Williams, a Black man from Farmington Hills, Michigan. On January 9, 2020, Williams was standing in his driveway when Detroit police officers arrived and arrested him in front of his wife and daughters. He was handcuffed and taken to a detention facility, where he was held overnight.

The basis for the arrest was a facial recognition match. A Detroit police detective had run an image from a store security camera — showing a Black man who had allegedly stolen watches — through a facial recognition database. The system had returned a match to Williams's driver's license photo. Williams was arrested on that basis.

At the police station, a detective showed Williams a photo array that included the security camera image and his driver's license photo. Williams compared the two images and pointed out obvious differences. "This is me," he said of his license photo, then of the security camera image: "And this is not me." The detective ended the interview, and Williams was held overnight before being released. The charges were eventually dropped after prosecutors reviewed the case and found the facial recognition match unsupported.

Williams's case attracted national attention when the American Civil Liberties Union publicized it in June 2020. It became the first documented case in the United States of a wrongful arrest based on a facial recognition match — though advocates and attorneys representing affected individuals noted that it was almost certainly not the only such case, simply the first to receive public attention.

The mechanism of the wrongful arrest is significant. The facial recognition system produced a false positive — a match between two different people. This is precisely the error that NIST's evaluation had found occurred at dramatically higher rates for Black faces than for white faces. Williams's arrest was not an anomaly or a system malfunction; it was an example of exactly the differential failure mode that the research predicted.


6. The Case of Nijeer Parks: Wrongful Arrest in New Jersey, 2020

A second documented case of wrongful arrest based on facial recognition occurred in 2020. Nijeer Parks, a Black man from Woodbridge, New Jersey, was arrested in January 2019 (the case became public in 2020) on charges that he had tried to pay for items at a Hampton Inn with a fake ID and had fled from police, nearly striking an officer with a car. He was arrested based on a facial recognition match between security camera footage and a photo from a database of state IDs.

Parks spent ten days in jail before being released on bail. He paid approximately $5,000 for legal representation. Prosecutors eventually reviewed the case and dropped all charges. Parks sued the Hamilton Township police department, which had used facial recognition in the investigation.

As with the Williams case, the match was a false positive. Parks had not been in Woodbridge on the day of the incident; he was more than thirty miles away, a fact that investigators failed to verify before acting on the facial recognition match.

The Parks case underscored several systemic concerns beyond the accuracy disparity itself. Investigators in both cases appeared to treat the facial recognition match as more reliable than the tool's known limitations warranted. Standard investigative corroboration — establishing that the suspect could have been at the location, checking alibi evidence, comparing physical characteristics beyond a single still image — was either skipped or given insufficient weight.

The combination of overreliance on facial recognition as investigative evidence and the technology's dramatically higher false positive rate for Black faces creates a predictable and documented pathway to wrongful arrest and prosecution of Black individuals. Both Williams and Parks described the experience as traumatic. Parks lost his job during the period he was facing criminal charges.


7. The Moratorium Movement: Cities Banning Municipal Use

The combination of documented accuracy disparities, high-profile wrongful arrests, and the heightened political attention to racial justice in 2020 drove a wave of municipal legislation restricting or banning facial recognition.

San Francisco became the first US city to ban government use of facial recognition technology, in May 2019 — before the Williams and Parks cases became public. Oakland, Berkeley, and Boston followed. Portland, Oregon enacted one of the most comprehensive bans, prohibiting both government and private sector use of facial recognition in public spaces. Detroit — the city where Robert Williams had been wrongfully arrested — placed new restrictions on its use of facial recognition, requiring additional corroboration before any arrest could be based in part on a facial recognition match.

These moratoriums varied considerably in scope and permanence. Some banned all government uses; others banned law enforcement uses while permitting other municipal applications. Some were framed as permanent prohibitions; others as temporary moratoriums pending federal regulation or independent evaluation. Several have been revised, narrowed, or allowed to lapse as political attention has shifted.

The moratorium movement reflects an important ethical position: in the absence of regulation ensuring minimum accuracy and equal performance across demographic groups, the precautionary principle argues for non-deployment of high-stakes technology in contexts where false positives can result in arrest, detention, or other deprivation of liberty. This is a legitimate regulatory choice, and the documented wrongful arrests provide concrete evidence for it.


8. The FBI's Use Despite Documented Disparities

The federal law enforcement context presents a different picture. The FBI has operated a facial recognition system called Next Generation Identification–Interstate Photo System (NGI-IPS) since 2011 and has gradually expanded its use. A 2019 Government Accountability Office report examined federal use of facial recognition and found that federal agencies were using the technology despite significant questions about accuracy, including the NIST findings of demographic disparities. The GAO noted that the FBI did not consistently vet the accuracy of the facial recognition algorithms it used and had not fully assessed the civil liberties implications of its programs.

A 2021 GAO report found that law enforcement agencies within the Department of Homeland Security and Department of Justice, including the FBI and DEA, used facial recognition with limited internal controls, inconsistent documentation, and without always verifying that the algorithms used had been evaluated for demographic accuracy.

The federal situation illustrates a fundamental accountability gap: while some cities were enacting bans, the most powerful law enforcement agencies in the country were expanding use of the same technology without consistent standards for accuracy, demographic equity, or oversight. The absence of federal regulation has meant that the most consequential uses of facial recognition — by agencies with national jurisdiction and substantial investigative power — are also among the least regulated.


9. What Accurate Facial Recognition Would Require

The NIST findings do not demonstrate that facial recognition systems cannot be accurate across demographic groups in principle — they demonstrate that the systems evaluated in 2019 were not. What would accurate, equitable facial recognition require?

First and most fundamentally, it would require training data that adequately represents all demographic groups. The accuracy disparities documented by NIST and Buolamwini are in significant part a consequence of training datasets that overrepresented lighter-skinned, male, Western faces and underrepresented other demographic groups. Models trained predominantly on one population will perform less well on others. Collecting and curating more representative training datasets is technically feasible but requires deliberate investment.

Second, it would require evaluation protocols that disaggregate performance by demographic group and establish minimum performance thresholds across all groups. A system that achieves 99 percent accuracy overall while performing at 85 percent accuracy for Black women should not be approved for deployment in high-stakes contexts. The standard for deployment must be set at the group level, not just the aggregate level.

Third, it would require ongoing monitoring of deployed systems across demographic groups, with established protocols for pausing or withdrawing systems that show performance degradations or differential performance in real-world conditions that were not apparent in pre-deployment testing.

Fourth, it would require robust use policies that limit the weight given to facial recognition matches and require corroborating evidence before any consequential action — particularly arrest — is taken. Even a highly accurate system can produce false positives, and those false positives have higher costs in some contexts than others. Policy constraints on use are as important as technical constraints on accuracy.

None of these requirements is technically exotic. They are all, in principle, achievable. Their absence in the systems and practices documented by NIST and the GAO reflects a combination of commercial pressure to deploy quickly, regulatory absence, and insufficient attention to the differential consequences of errors across demographic groups.


10. The Privacy vs. Bias Trade-Off: Even Accurate Recognition Raises Other Concerns

The focus of this case study is on accuracy disparities across demographic groups, but it is important to note that the concerns about facial recognition technology extend beyond bias. Even a perfectly accurate, perfectly equitable facial recognition system would raise profound questions about privacy, civil liberties, and the appropriate scope of government surveillance.

Facial recognition enables large-scale, automated identification of individuals in public spaces without their knowledge or consent. The ability to identify individuals from surveillance camera footage, to track their movements through public spaces, and to correlate those movements with other databases creates surveillance capabilities that are qualitatively different from historical police practices. These capabilities raise concerns about chilling effects on free expression and free assembly — if people know that their attendance at a protest or a political meeting can be identified and recorded, some will choose not to attend.

These concerns are not directly a function of accuracy; they apply even to perfectly accurate systems. The moratorium movement has sometimes conflated the accuracy argument with the privacy argument, making it important for policy analysis to keep them distinct. The accuracy argument — that systems with documented demographic disparities should not be used in high-stakes contexts — is strong and well-supported by the evidence. The privacy argument — that mass identification technology raises concerns independent of its accuracy — is also strong and applies even to systems that achieve the accuracy improvements the accuracy argument demands.

Business professionals considering facial recognition deployment should grapple with both dimensions. The bias argument sets a floor: do not deploy systems with documented demographic disparities in high-stakes applications. The privacy argument sets a ceiling: even above that floor, evaluate whether the surveillance capabilities enabled by the technology are consistent with the organization's values and the legal and social norms of the contexts in which it operates.


11. Discussion Questions

  1. NIST's findings showed that false positive rates for Black faces were 10 to 100 times higher than for white faces in commercial facial recognition systems. Given this documented disparity, what standard of corroboration should be required before law enforcement takes any action — including arrest — based in part on a facial recognition match? Should different standards apply for different demographic groups?

  2. Robert Williams and Nijeer Parks were wrongfully arrested based on facial recognition false positives. What legal remedies should be available to them? Who should be liable — the police department that used the technology, the city that authorized the program, or the facial recognition vendor whose system produced the false match?

  3. Several major cities banned government use of facial recognition, while federal law enforcement agencies continued to expand their programs. What does this disparity in regulatory response tell us about the relationships between local, state, and federal authority over surveillance technology? Is municipal regulation sufficient?

  4. IBM and Microsoft voluntarily paused sales of facial recognition to law enforcement without being required to do so by law. Evaluate this decision. Was it driven primarily by ethics or by reputational and commercial considerations? Does the distinction matter for the policy outcome?


This case study is referenced in Chapter 9 (fairness metrics), Chapter 11 (law enforcement AI), and Chapter 16 (surveillance technology). The NIST findings are a primary empirical anchor for discussions of accuracy disparities throughout this textbook.