Quiz: Bias in Data, Bias in Machines

Test your understanding before moving to the next chapter. Target: 70% or higher to proceed.


Section 1: Multiple Choice (1 point each)

1. According to the chapter's definition (Section 14.1.1), algorithmic bias is:

  • A) Any error a machine learning system makes when classifying individuals.
  • B) Systematic outcomes that disadvantage certain groups in unjustified ways.
  • C) Intentional discrimination programmed into code by developers.
  • D) A statistical phenomenon that occurs only in small datasets.
Answer **B)** Systematic outcomes that disadvantage certain groups in unjustified ways. *Explanation:* Section 14.1.1 defines algorithmic bias with three essential elements: *systematic* (patterned, not random), *disadvantage* (differential harm to certain groups), and *unjustified* (driven by irrelevant or discriminatory factors). Option A describes general error, not bias. Option C requires intent, but the chapter emphasizes that the most dangerous biases are structural and unintentional. Option D is incorrect because bias can occur in datasets of any size.

2. A hospital builds a patient risk model using data that underrepresents elderly patients because fewer elderly patients were enrolled in the clinical trials from which the training data was drawn. This is an example of:

  • A) Historical bias
  • B) Representation bias
  • C) Deployment bias
  • D) Evaluation bias
Answer **B)** Representation bias. *Explanation:* Section 14.2.1 defines representation bias as occurring when the training data does not adequately represent the population the model will serve. Elderly patients being underrepresented in clinical trials means the model has insufficient examples of this population, leading to worse performance for them. Historical bias (A) exists in the world before data is collected. Deployment bias (C) occurs when a model is used in a different context than intended. Evaluation bias (D) occurs when benchmark data is unrepresentative.

3. Using arrest records as a proxy for criminal behavior introduces measurement bias because:

  • A) Arrest records are always inaccurate.
  • B) Arrests reflect policing patterns as much as actual crime — communities that are more heavily policed have more arrests, not necessarily more crime.
  • C) Arrest records are too small a dataset for machine learning.
  • D) Criminal behavior cannot be predicted by any statistical method.
Answer **B)** Arrests reflect policing patterns as much as actual crime — communities that are more heavily policed have more arrests, not necessarily more crime. *Explanation:* Section 14.2.1 explains that measurement bias arises when the features or labels used in a model are poor proxies for the underlying concept. Arrest records measure *policing activity*, not *criminal behavior*. In communities subject to intensive policing, more crimes are detected and recorded, creating the false appearance of higher crime rates. The proxy is contaminated by the very social structure the model is supposed to assess neutrally.

4. The bias pipeline presented in Section 14.3 identifies bias entry points at how many stages of machine learning development?

  • A) Two: data collection and model training
  • B) Four: data, training, evaluation, and deployment
  • C) Six: problem formulation, data collection, feature engineering, model training, evaluation, and deployment
  • D) One: the training data is the sole source of bias
Answer **C)** Six: problem formulation, data collection, feature engineering, model training, evaluation, and deployment. *Explanation:* Section 14.3.1 presents the full bias pipeline with six stages, emphasizing that bias can enter at *every* stage. This is a critical insight because it means that addressing bias at only one stage (e.g., "cleaning" the training data) is insufficient. Each stage introduces distinct types of bias and requires distinct interventions.

5. ProPublica's 2016 COMPAS investigation found that:

  • A) The algorithm was equally accurate for all racial groups and no bias existed.
  • B) Black defendants who did not reoffend were nearly twice as likely to be falsely flagged as high-risk compared to white defendants.
  • C) The algorithm used race as an explicit input feature, which caused the disparity.
  • D) White defendants received systematically higher risk scores than Black defendants.
Answer **B)** Black defendants who did not reoffend were nearly twice as likely to be falsely flagged as high-risk compared to white defendants. *Explanation:* Section 14.4.2 reports ProPublica's key finding: the false positive rate for Black defendants was 44.9% vs. 23.5% for white defendants. Conversely, white defendants who did reoffend were more likely to be falsely classified as low-risk. The algorithm did not use race explicitly (C is incorrect), and overall accuracy was similar for both groups — but the *types* of errors were asymmetrically distributed.

6. Northpointe (now Equivant) responded to ProPublica's findings by arguing that COMPAS was:

  • A) Never intended for use in criminal justice decisions.
  • B) Calibrated — defendants who received the same score had similar recidivism rates regardless of race.
  • C) Perfectly accurate and therefore fair by any standard.
  • D) Biased only in a statistical sense with no real-world consequences.
Answer **B)** Calibrated — defendants who received the same score had similar recidivism rates regardless of race. *Explanation:* Section 14.4.3 explains Northpointe's defense: the system was calibrated, meaning that a score of "7" predicted approximately the same recidivism rate for Black and white defendants. Both ProPublica's claim (disparate false positive rates) and Northpointe's claim (calibration) are mathematically correct — and, as Chapter 15 will show, they are mutually incompatible when base rates differ.

7. The Obermeyer et al. (2019) healthcare study found that the algorithm's primary source of bias was:

  • A) Using race as an explicit input variable.
  • B) Using healthcare costs as a proxy for health need, which systematically underestimated the needs of Black patients.
  • C) Having too little data to produce accurate predictions.
  • D) Being trained on data from a single hospital rather than a national dataset.
Answer **B)** Using healthcare costs as a proxy for health need, which systematically underestimated the needs of Black patients. *Explanation:* Section 14.5 explains that the algorithm used healthcare spending as a proxy for health need. Because Black patients historically have less access to healthcare — due to insurance gaps, geographic barriers, implicit clinical bias, and socioeconomic constraints — they generate lower healthcare costs at the same level of illness. The algorithm interpreted lower spending as lower need, systematically directing resources away from the patients who needed them most.

8. Amazon's experimental hiring algorithm penalized resumes containing the word "women's" because:

  • A) Amazon engineers deliberately programmed the system to discriminate.
  • B) The system was trained on 10 years of hiring data from a male-dominated workforce, and learned that male-associated patterns predicted hiring success.
  • C) The word "women's" appeared on resumes with lower GPAs.
  • D) A bug in the natural language processing module incorrectly parsed the word.
Answer **B)** The system was trained on 10 years of hiring data from a male-dominated workforce, and learned that male-associated patterns predicted hiring success. *Explanation:* Section 14.6 explains that Amazon's algorithm was trained on historical hiring data that reflected the company's predominantly male engineering workforce. The algorithm learned that features associated with maleness (including the *absence* of "women's" on a resume) predicted successful hiring. When Amazon removed gender as a feature, the algorithm used proxy features — women's colleges, women's organizations, certain verbs — to reconstruct the gender signal. This is a textbook case of historical bias amplified through redundant encoding.

9. The four-fifths rule states that a selection process may have disparate impact if:

  • A) More than 80% of one group is rejected.
  • B) The selection rate for any group is less than four-fifths (80%) of the selection rate for the group with the highest rate.
  • C) The overall accuracy of the system is below 80%.
  • D) The false positive rate exceeds 20% for any group.
Answer **B)** The selection rate for any group is less than four-fifths (80%) of the selection rate for the group with the highest rate. *Explanation:* Section 14.7.1 defines the four-fifths rule: the disparate impact ratio = (selection rate of disadvantaged group) / (selection rate of advantaged group). If this ratio falls below 0.8, the process may have disparate impact and warrants investigation. The rule is a screening tool, not a definitive test of discrimination.

10. The chapter argues that removing a protected attribute (like race) from a model's input features is usually insufficient to prevent bias because:

  • A) Legal regulations require that race always be included in algorithmic models.
  • B) Other features in the model may serve as proxy variables that are correlated with the protected attribute due to structural inequality.
  • C) Machine learning algorithms require at least one demographic variable to function.
  • D) Models without demographic features always have lower accuracy.
Answer **B)** Other features in the model may serve as proxy variables that are correlated with the protected attribute due to structural inequality. *Explanation:* Section 14.3.1 explains that proxy variables — features correlated with protected characteristics due to structural factors — allow models to reconstruct protected characteristics even when they are explicitly excluded. Zip code proxies for race (due to residential segregation), healthcare spending proxies for race (due to access disparities), and resume formatting may proxy for socioeconomic status. This phenomenon, called redundant encoding, means that "blinding" a model is a naive and typically ineffective intervention.

Section 2: True/False with Justification (1 point each)

For each statement, determine whether it is true or false and provide a brief justification.

11. "Algorithmic bias requires intentional discrimination by the system's developers."

Answer **False.** *Explanation:* Section 14.1.1 emphasizes that "the most dangerous biases are structural — embedded in the data, the features, and the optimization objectives by the same social forces that produce inequality in the first place." Nobody at Amazon told the hiring algorithm to penalize women; the algorithm learned it from biased data. Nobody at VitraMed told the risk model to underserve Black patients. Intent is not required for bias to occur; structural conditions are sufficient.

12. "The Obermeyer healthcare study demonstrated that using a different proxy variable — such as number of doctor visits instead of healthcare spending — would have eliminated the racial bias."

Answer **False.** *Explanation:* Section 14.5.3 explicitly cautions that *any* proxy can be biased if it correlates with social structures that differ across groups. Using "number of doctor visits" as a proxy for health need would introduce similar bias, because fewer visits could reflect less access rather than less need. The lesson is not to avoid a specific proxy but to evaluate whether any proxy's relationship to the target concept is consistent across the groups being served.

13. "If a machine learning model achieves 95% overall accuracy, it can be considered fair."

Answer **False.** *Explanation:* Section 14.3.1 (Stage 5: Evaluation) explains that overall accuracy can mask significant subgroup disparities. A model with 95% overall accuracy might have 98% accuracy for Group A and 78% accuracy for Group B — but if Group B is a small proportion of the dataset, the aggregate metric conceals the failure. Fairness requires disaggregated analysis, not just overall performance.

14. "Feedback loops in biased algorithmic systems can transform one-time biases into self-reinforcing cycles of discrimination."

Answer **True.** *Explanation:* Section 14.8 describes how feedback loops work: a biased prediction leads to biased action (e.g., more policing in a neighborhood), which generates biased data (more detected crimes in that neighborhood), which confirms the original biased prediction. Over time, the bias compounds — the system becomes increasingly confident in a pattern it is partially creating. This self-reinforcing dynamic is one of the most dangerous properties of deployed algorithmic systems.

15. "The COMPAS debate cannot be resolved by choosing better algorithms, because ProPublica and Northpointe were using different definitions of fairness that are mathematically incompatible."

Answer **True.** *Explanation:* Section 14.4.3 explains that ProPublica measured fairness by error-rate parity (equal false positive rates) while Northpointe measured fairness by calibration (equal predictive values at each score level). Both are mathematically correct, and — as Chapter 15 will demonstrate — they cannot be simultaneously satisfied when base rates differ across groups. The disagreement is definitional, not computational.

Section 3: Short Answer (2 points each)

16. Explain the concept of the "bias pipeline" (Section 14.3). Why is it important to understand that bias can enter at every stage of ML development, not just during data collection? Provide one example of bias entering at a stage other than data collection.

Sample Answer The bias pipeline identifies six stages where bias can enter: problem formulation, data collection, feature engineering, model training, evaluation, and deployment. Understanding that bias can enter at every stage is important because it prevents the common misconception that bias is solely a data problem. If bias is treated as only a data issue, interventions focus narrowly on "better data" — but this ignores the biases introduced by how the problem is framed, which features are selected, how the model is evaluated, and how it is deployed. Example at the problem formulation stage: A hospital decides to build a model that identifies patients who will "benefit most from care coordination." If "benefit" is defined as "cost savings," the model optimizes for reducing spending — which systematically disadvantages populations who historically spend less on healthcare due to access barriers. The bias enters before any data is collected, at the moment the problem is defined. *Key points for full credit:* - Identifies all six pipeline stages - Explains why the multi-stage view matters - Provides a specific example at a non-data-collection stage

17. The chapter describes a "proxy trap" in the context of the Obermeyer healthcare study. In your own words, explain what a proxy trap is. Why did healthcare spending seem like a reasonable proxy for health need? What made it a trap?

Sample Answer A proxy trap occurs when a readily available, seemingly reasonable variable is used as a stand-in for a concept that is harder to measure — but the proxy's relationship to the target concept is contaminated by structural inequality, producing biased results. Healthcare spending seemed reasonable because it was readily available in claims databases, continuously updated, predictive of future costs, and operationally useful. Patients who spent more in the past tended to need more care in the future. The logic appeared sound. It was a trap because healthcare spending measures healthcare *consumption*, not healthcare *need*. Black patients, at the same level of illness, generate lower healthcare costs than white patients because they have historically had less access to care — insurance gaps, geographic barriers, implicit physician bias, and socioeconomic constraints all reduce spending independent of health status. The algorithm interpreted their lower spending as lower need, when it actually reflected lower access. The proxy was statistically valid but socially biased. *Key points for full credit:* - Defines what a proxy trap is - Explains why spending seemed reasonable - Explains the mechanism of the bias (spending measures access, not need)

18. Section 14.9 discusses intersectional bias. Explain the concept of intersectionality as applied to algorithmic systems. Why does the chapter argue that single-axis analysis (looking at race alone or gender alone) is "insufficient"?

Sample Answer Intersectionality, a concept developed by Kimberle Crenshaw, holds that individuals who belong to multiple marginalized groups may experience forms of discrimination that are qualitatively distinct from the discrimination faced by any single-axis group. A Black woman's experience is not simply "being Black" plus "being female" — it is a unique position that can involve specific forms of harm invisible when race and gender are analyzed separately. In algorithmic systems, intersectional bias means that a system might perform adequately for Black men and for white women, but fail significantly for Black women. Buolamwini and Gebru's "Gender Shades" study demonstrated this: commercial facial recognition systems had error rates of up to 34.7% for dark-skinned women, compared to less than 1% for light-skinned men — a disparity that single-axis analysis by race alone or gender alone would have partially masked. If you only checked "accuracy by race," you might see a moderate gap; if you only checked "accuracy by gender," you might see a different moderate gap. Only the intersectional analysis reveals the full magnitude of the failure. *Key points for full credit:* - Defines intersectionality and credits Crenshaw - Explains why single-axis analysis can mask the worst disparities - Provides a specific example (Gender Shades or similar)

19. The chapter states: "A perfectly calibrated model that reflects a discriminatory reality is a perfectly precise instrument of discrimination" (Section 14.1.2). Explain what this means. How can a model be both accurate and harmful?

Sample Answer A model can be both accurate and harmful when the data it is trained on reflects an unjust reality. If historical discrimination has produced patterns — fewer women in engineering, higher re-arrest rates in over-policed neighborhoods, lower healthcare spending by marginalized communities — then a model trained on this data will accurately reproduce these patterns. It will correctly predict, based on historical data, that women are less likely to be hired for engineering roles. The prediction is accurate. But acting on that prediction perpetuates the discrimination that produced the data. The statement means that technical precision is not the same as social justice. A thermometer that accurately measures a fever does not cause the fever — but an algorithm that accurately predicts who has been historically disadvantaged and then allocates fewer resources to them is *using* historical injustice as a basis for present decisions. Mathematical accuracy does not confer moral legitimacy. The model is precise about the wrong thing — it is precisely encoding inequality and presenting it as neutral prediction. *Key points for full credit:* - Explains how accuracy and harm can coexist - Connects to the concept of historical bias - Distinguishes between statistical validity and social fairness

Section 4: Applied Scenario (5 points)

20. Read the following scenario and answer all parts.

Scenario: TalentMatch AI

A large staffing agency deploys "TalentMatch AI," an algorithm that matches job seekers to open positions. The system was trained on five years of the agency's placement data — records of which candidates were placed in which positions and whether those placements were rated as "successful" (the candidate stayed at least 90 days and received a positive manager review).

TalentMatch AI uses the following features: years of experience, education level, skills listed on resume, employment gaps (number of months without employment in the past 10 years), zip code, commute distance to job location, and a "communication score" generated by an NLP analysis of the candidate's written application responses.

After six months, an internal audit reveals the following selection rates for the agency's most-placed job category (administrative assistant positions):

Group Applicants Placed Selection Rate
White applicants 2,000 640 32.0%
Black applicants 1,500 330 22.0%
Hispanic applicants 1,200 264 22.0%
Asian applicants 800 248 31.0%

(a) Calculate the disparate impact ratio for each non-white group relative to the highest-selected group. Which groups, if any, fall below the four-fifths threshold? Show your work. (1 point)

(b) Identify at least three features in TalentMatch AI's design that could function as proxy variables for race or socioeconomic status. For each, explain the mechanism by which bias enters. (1 point)

(c) Trace this system through the bias pipeline (Section 14.3). Identify at least two specific stages where bias likely entered and explain how. (1 point)

(d) The staffing agency responds to the audit by saying: "The algorithm treats everyone equally — it uses the same features and the same model for every applicant. The disparate outcomes reflect differences in applicant qualifications, not algorithmic bias." Evaluate this defense using at least two concepts from Chapter 14. (1 point)

(e) Propose three specific interventions to reduce bias in TalentMatch AI. For each, identify which type of bias (from the taxonomy in Section 14.2) it addresses and what trade-offs, if any, it involves. (1 point)

Sample Answer **(a)** The highest selection rate is White applicants at 32.0%. - Black applicants: 22.0 / 32.0 = **0.6875** — below 0.8 threshold. **Flagged.** - Hispanic applicants: 22.0 / 32.0 = **0.6875** — below 0.8 threshold. **Flagged.** - Asian applicants: 31.0 / 32.0 = **0.9688** — above 0.8 threshold. Not flagged. Both Black and Hispanic applicants have disparate impact ratios well below the four-fifths threshold, indicating potential disparate impact that warrants investigation. **(b)** Three proxy features: - **Zip code:** Due to residential segregation, zip code is strongly correlated with race. Using zip code as a feature allows the model to effectively learn racial patterns through geographic proxies, potentially disadvantaging applicants from predominantly Black or Hispanic neighborhoods. - **Employment gaps:** Employment gaps may disproportionately affect women (due to maternity leave), formerly incarcerated individuals (disproportionately Black and Hispanic due to criminal justice disparities), and caregivers in communities with less access to childcare. The feature penalizes structural disadvantage rather than measuring job capability. - **Communication score (NLP):** NLP-generated communication scores can encode linguistic bias. If the training data disproportionately labeled candidates who used Standard American English as "successful," the model may penalize candidates who use African American Vernacular English, Spanish-influenced English, or other linguistic varieties. The "communication" being measured is cultural conformity, not communication competence. **(c)** Two pipeline stages: - **Data collection (Stage 2):** The training data reflects five years of the agency's own placement decisions. If those decisions were influenced by human bias — conscious or unconscious preferences for certain candidates — the data encodes that bias. The algorithm then learns to reproduce the agency's historical discrimination as "optimal matching." - **Feature engineering (Stage 3):** Including zip code and employment gaps as features introduces proxy variables that correlate with race and socioeconomic status. These features may be statistically predictive of the agency's historical placement decisions precisely *because* those decisions were biased — creating a circular justification. **(d)** The defense is flawed on two grounds. First, the concept of **disparate impact** (Section 14.7.1) holds that treating everyone "the same" can still produce discriminatory outcomes if the features used have differential impacts across groups. Formal equality of treatment does not guarantee substantive fairness. Second, the concept of **historical bias** (Section 14.2.1) means that "applicant qualifications" are not neutral facts — they are shaped by structural inequality. If employment gaps are more common among Black and Hispanic applicants due to systemic factors (incarceration disparities, caregiving burdens), then penalizing gaps perpetuates that inequality rather than measuring job capability. Treating structural disadvantage as an individual qualification deficit is the mechanism by which historical bias becomes algorithmic bias. **(e)** Three interventions: 1. **Remove zip code and commute distance** (addresses measurement bias and proxy variables). These features are proxies for race and socioeconomic status. Trade-off: the model may lose some predictive accuracy for placement success, since proximity to the workplace may genuinely affect retention. Consider replacing with a binary "within commuting range" feature. 2. **Retrain the communication score on diverse, bias-audited text samples** (addresses representation bias and measurement bias). Ensure the NLP model is trained on a linguistically diverse corpus and validated across dialects. Trade-off: requires investment in building a diverse training set and ongoing monitoring. 3. **Add a bias-aware evaluation stage** (addresses evaluation bias). Evaluate model performance separately for each demographic group, not just in aggregate. Set explicit thresholds for group-level performance and do not deploy if disparate impact ratios fall below 0.8. Trade-off: may reduce overall accuracy if the model must be adjusted to equalize performance across groups.

Scoring & Review Recommendations

Score Range Assessment Next Steps
Below 50% (< 15 pts) Needs review Re-read Sections 14.1-14.4 carefully, redo Part A exercises
50-69% (15-20 pts) Partial understanding Review specific weak areas, focus on the COMPAS and Obermeyer case studies
70-85% (21-25 pts) Solid understanding Ready to proceed to Chapter 15; review any missed topics
Above 85% (> 25 pts) Strong mastery Proceed to Chapter 15: Fairness — Definitions, Tensions, and Trade-offs
Section Points Available
Section 1: Multiple Choice 10 points (10 questions x 1 pt)
Section 2: True/False with Justification 5 points (5 questions x 1 pt)
Section 3: Short Answer 8 points (4 questions x 2 pts)
Section 4: Applied Scenario 5 points (5 parts x 1 pt)
Total 28 points