Chapter 8: Assessment Quiz

Total Questions: 20 Sections: Multiple Choice (8) | True/False (5) | Short Answer (4) | Applied Scenario (3) Suggested Time: 60–75 minutes


Section 1: Multiple Choice

Select the single best answer for each question.


Question 1 According to the Suresh and Guttag (2019) taxonomy, which type of bias arises when a model is applied to a population that is substantially different from the population it was trained on?

A) Representation bias B) Aggregation bias C) Deployment bias D) Measurement bias


Question 2 A credit scoring model uses "number of bank accounts held for more than 10 years" as a feature. Black and Latino applicants, who were historically excluded from mainstream banking through redlining and discriminatory lending, are less likely to have this account history. What primary bias type does this feature introduce?

A) Representation bias, because the training data undersamples Black and Latino applicants B) Historical bias, because the feature reflects patterns created by historical discrimination C) Aggregation bias, because a single threshold is applied across heterogeneous groups D) Deployment bias, because the model is being used in a different context than it was designed for


Question 3 A voice recognition system trained on a dataset of speakers aged 18–45 is deployed in a customer service application that serves callers across all age groups. Which bias types are MOST directly implicated?

A) Historical bias and aggregation bias B) Representation bias and deployment bias C) Measurement bias and evaluation bias D) Aggregation bias and deployment bias


Question 4 Researchers find that a commercial face analysis system has an error rate of 1.2% for lighter-skinned men and 34.7% for darker-skinned women. This disparity was not discovered until after the system had been commercially released for three years, because the system was evaluated on a benchmark dataset that overrepresented lighter-skinned men. Which bias type is primarily responsible for the failure to discover the disparity earlier?

A) Historical bias B) Aggregation bias C) Evaluation bias D) Deployment bias


Question 5 The Bolukbasi et al. (2016) study of word embeddings found that embeddings trained on Google News text associated the word "programmer" with the male vector and "homemaker" with the female vector. For a resume screening model that uses these embeddings as input features, this finding is most directly relevant to which bias type?

A) Historical bias, because the associations reflect historical occupational segregation B) Measurement bias, because the embeddings measure language differently for different groups C) Both historical bias and measurement bias — the embeddings encode historical stereotypes and will measure applicant qualifications differently by gender D) Evaluation bias, because the Google News training corpus is not a representative benchmark


Question 6 A clinical AI model for diabetic patient management uses a single HbA1c threshold of 6.5% to identify patients requiring intensified treatment. Research shows that Black patients have higher average HbA1c levels than white patients with equivalent blood glucose control, due to differences in red blood cell turnover. A single-threshold model will therefore have different sensitivity and specificity for Black and white patients. This is a textbook example of:

A) Deployment bias B) Historical bias C) Aggregation bias D) Representation bias


Question 7 Which of the following is the MOST accurate description of why removing protected attributes (race, gender, national origin) from a training dataset is insufficient to prevent discriminatory model outputs?

A) Machine learning algorithms can infer protected attributes from model architecture even without seeing them in data B) Protected attribute information is distributed across correlated variables (proxy variables) that remain in the model C) Removing protected attributes causes evaluation bias, making it harder to measure discrimination after removal D) Regulators require that protected attributes be present in training data to enable auditing


Question 8 An enterprise company licenses an LLM API to power a customer service chatbot. An internal audit reveals the chatbot generates more helpful responses for customers who write in Standard American English than for customers who use African American Vernacular English (AAVE). The enterprise company's legal and compliance team argues that this is the model developer's responsibility because the company did not train the model. Which of the following best describes the enterprise company's actual responsibility?

A) The enterprise company bears no responsibility because it did not develop the model B) The enterprise company shares responsibility with the model developer because it deployed the system and is liable for its discriminatory outputs in its deployment context C) Responsibility lies solely with the customers, who chose to use AAVE in a commercial context D) The enterprise company is responsible only if it was aware of the bias before deployment


Section 2: True / False

Write True or False and provide a 2–3 sentence justification for your answer.


Question 9 A machine learning model trained and evaluated on a benchmark dataset with 95% overall accuracy can be confidently described as a high-performing model for all user populations.


Question 10 The primary cause of anti-Muslim bias documented in GPT-3 by Abid et al. (2021) was a deliberate design choice by OpenAI engineers to associate Muslims with violence.


Question 11 Reinforcement Learning from Human Feedback (RLHF) eliminates the cultural biases encoded in a large language model's base weights, so that RLHF-trained models are safe to deploy in any context without further bias monitoring.


Question 12 The Sjoding et al. (2020, NEJM) study found that pulse oximeters underestimated oxygen saturation in Black patients (i.e., showed lower readings than true values), leading to unnecessary hospitalization.


Question 13 Datasheets for Datasets, as proposed by Gebru et al. (2018), provide a standardized format for documenting training datasets in ways that support bias auditing and informed deployment decisions.


Section 3: Short Answer

Answer each question in 150–250 words.


Question 14 Explain the concept of "proxy whack-a-mole" in the context of algorithmic fairness. Why does removing a single identified proxy variable typically fail to resolve the discrimination problem it was intended to address? Use a specific example from financial services or hiring to illustrate your answer.


Question 15 The chapter introduces the distinction between "ethics washing" and "genuine ethics" in organizational AI practices. Describe three specific markers that distinguish genuine organizational commitment to bias prevention from ethics washing. For each marker, explain what the marker looks like in practice and why its presence or absence is meaningful.


Question 16 What is aggregation bias, and why does it represent a particular challenge for AI systems deployed across demographically diverse populations? Describe both the technical mechanism and the organizational challenge, and identify one domain where aggregation bias has had documented clinical or social consequences.


Question 17 Explain why the performance gap between demographic groups in AI systems — where systems perform better for majority groups than minority groups — is not a random outcome but a predictable consequence of how AI systems are typically developed and evaluated. Identify at least three specific points in the development pipeline where this performance gap is widened.


Section 4: Applied Scenario

Each scenario question requires an extended response of 300–500 words. Your response should demonstrate specific knowledge from the chapter rather than general reasoning.


Question 18: The Hiring Algorithm

A technology company has developed an AI resume screening tool to help process the high volume of applications it receives for engineering roles. The tool was trained on ten years of historical hiring data. The company's workforce is currently 82% male. The tool achieves 91% accuracy in predicting which applicants the company's hiring managers would have selected, based on a held-out validation set drawn from the same historical period.

Identify and explain the specific bias types present in this scenario. For each bias type: (a) explain the mechanism by which it enters the system, (b) describe the likely discriminatory outcome, and (c) propose a specific mitigation measure. Conclude with a recommendation about whether this tool should be deployed as-is, modified and then deployed, or not deployed.


Question 19: The International Deployment

A US-based healthcare AI company develops a symptom triage tool using a training dataset drawn from electronic health records at 12 US academic medical centers. The dataset contains 2.4 million patient records. The company reports overall diagnostic accuracy of 88% and has published disaggregated performance metrics showing acceptable performance across Black, white, and Hispanic patient populations in the US.

The company is now proposing to license the tool to hospitals in Nigeria, Ghana, and South Africa. A local NGO raises concerns about deploying the tool in these markets. Explain the specific bias risks this international deployment creates, drawing on at least three distinct concepts from the chapter. What should the company be required to do before deploying the tool in these markets?


Question 20: The LLM Content Moderation Tool

A major social media platform is deploying an AI content moderation system powered by a fine-tuned large language model. The system is designed to detect and remove hate speech. Internal testing shows that the system removes posts written in African American Vernacular English (AAVE) at twice the rate of posts written in Standard American English, even when the AAVE posts do not violate the platform's hate speech standards. A civil rights organization has published a report documenting this pattern.

Using concepts from Chapter 8, explain: (a) the technical sources of the observed disparity, (b) why the RLHF fine-tuning on hate speech data may have worsened rather than improved the disparity for AAVE speakers, (c) what the platform's ethical and legal obligations are in response to the documented disparity, and (d) what remediation measures are technically feasible and organizationally realistic.


Answer Key


Multiple Choice Answers

Question Answer Rationale
1 C Deployment bias occurs when a system is applied to a population or context different from its design context. Population shift is a core form of deployment bias.
2 B The absence of long-term banking history for Black and Latino applicants is directly caused by historical discriminatory exclusion (redlining, discriminatory lending). The feature accurately reflects historical patterns — those patterns were themselves discriminatory.
3 B The training dataset underrepresents older speakers (representation bias). The system is then applied to a population including all age groups — a deployment context broader than the training context (deployment bias).
4 C The failure to discover the disparity was caused by evaluating the system on a biased benchmark that overrepresented lighter-skinned men. If the evaluation benchmark had been representative, the disparity would have been discovered earlier.
5 C Both answers are defensible. The associations reflect historical occupational segregation (historical bias), and the embeddings will produce different quality representations of male and female applicants' professional qualifications (measurement bias). A complete answer recognizes both.
6 C Aggregation bias: the model applies a single threshold (a single model) to a heterogeneous population with meaningfully different underlying distributions of the measured variable.
7 B The information carried by protected attributes is distributed across correlated proxy variables. Removing the protected attribute itself does not remove the predictive signal it carried, because that signal persists in correlated features.
8 B Enterprise deployers bear shared responsibility for the discriminatory outputs of systems they deploy, regardless of who developed the underlying model. Applicable anti-discrimination law does not provide a "third-party developer" exemption for disparate impact.

True / False Answers

Question 9: FALSE. Aggregate accuracy does not indicate performance for all subgroups. A model with 95% overall accuracy may have very poor accuracy for minority groups if those groups are underrepresented in the test set. Disaggregated performance metrics are required to assess performance across demographic groups. The statement conflates aggregate performance with equitable performance.

Question 10: FALSE. The anti-Muslim bias in GPT-3 was not a deliberate design choice; it emerged from the model learning associations present in its training data. English-language internet and media text disproportionately associates Muslims with terrorism and violence due to patterns in media coverage. The model learned these statistical associations from the training corpus. There is no documented evidence of intentional bias by OpenAI engineers.

Question 11: FALSE. RLHF fine-tuning reduces the frequency of overtly biased outputs in contexts where the RLHF training provides a strong signal, but it does not remove the underlying bias from the base model's weights. Biased associations remain in the model and can be elicited through prompts that differ from the RLHF training distribution. Ongoing monitoring remains necessary.

Question 12: FALSE. The Sjoding et al. (2020) study found that pulse oximeters OVERESTIMATED oxygen saturation in Black patients — showed readings HIGHER than true values. This is the clinically dangerous direction: patients with dangerously low actual oxygen saturation appeared to have normal readings, leading to under-treatment and delayed intervention, not unnecessary hospitalization.

Question 13: TRUE. Datasheets for Datasets (Gebru et al., 2018) is precisely this framework. It provides a standardized template for documenting dataset motivation, composition, collection process, preprocessing, appropriate uses, distribution terms, and maintenance plans — all information directly relevant to bias auditing and deployment decision-making.


Short Answer Scoring Rubric

Each short answer is worth up to 20 points.

  • 18–20 points: Accurately defines the concept, provides a specific and appropriate example, demonstrates understanding of practical implications; response is well-organized and within the word count.
  • 14–17 points: Accurately defines the concept with minor imprecision; provides an example but it may be partially correct or incompletely applied; practical implications are acknowledged but not fully developed.
  • 9–13 points: Definition is partially accurate; example is present but may be generic or misapplied; practical implications are minimal.
  • 0–8 points: Significant conceptual inaccuracies; no specific example or example is clearly incorrect; response does not demonstrate understanding of the chapter material.

Applied Scenario Scoring Rubric

Each applied scenario is worth up to 30 points.

  • 27–30 points: Accurately identifies all relevant bias types with correct mechanisms; proposed mitigations are specific and technically grounded; conclusion demonstrates integration of multiple chapter concepts; response directly engages the details of the scenario rather than making generic statements.
  • 21–26 points: Identifies most relevant bias types; mechanisms are mostly accurate with minor errors; mitigations are reasonable if somewhat general; conclusion is logical.
  • 14–20 points: Identifies some relevant bias types; mechanisms show partial understanding; mitigations are present but may be impractical or insufficiently specific; limited integration of chapter concepts.
  • 0–13 points: Fails to correctly identify primary bias types; mechanisms are significantly mischaracterized; mitigations are absent or clearly inappropriate; minimal engagement with chapter material.

End of Chapter 8 Quiz