Appendix H: Answers to Selected Exercises
This appendix provides answers, model responses, and rubric criteria for selected exercises from each chapter. For conceptual questions, we provide a model answer. For open-ended questions, we provide rubric criteria indicating what a strong response should include. Exercises are identified by their original numbering (e.g., "A3" means Part A, question 3).
Chapter 1: What Is Artificial Intelligence?
A3. Explain the difference between narrow AI and general AI.
Model answer: Narrow AI (also called weak AI) is a system designed to perform a specific task or small set of related tasks, often at superhuman levels within that domain. All existing AI systems are narrow AI. Examples: a chess engine, a spam filter, a language translation tool. General AI (also called strong AI or AGI) would be a system with the breadth and flexibility of human cognition — able to learn new tasks without specific training, transfer knowledge across domains, and reason about novel situations. General AI does not exist as of 2026. Example: the computer HAL 9000 from 2001: A Space Odyssey is a fictional depiction of what general AI might look like.
A5. Explain the AI effect.
Model answer: The AI effect is the tendency for people to redefine "intelligence" to exclude whatever machines can already do. Each time AI achieves a milestone once considered a hallmark of intelligence (playing chess, understanding speech, generating text), people dismiss it as "not really AI" and move the goalposts. The effect matters because it distorts public understanding: it causes people to simultaneously underestimate what current AI can do and overestimate what "real" AI would look like.
B3. Using the FACTS Framework, write three specific questions you would want answered before accepting the claim "AI Outperforms Doctors in Diagnosing Skin Cancer."
Rubric criteria: A strong response identifies specific, probing questions. Examples: - (F) What specific diagnostic task was tested? All skin cancers or one type? From images alone or with patient history? - (A) How was accuracy measured? On what dataset? Were the doctors given the same conditions as the AI (time pressure, image quality)? Were results broken down by skin tone? - (T) What images was the AI trained on? Were all skin tones, ages, and lesion types represented? - (C) If this system were deployed, who would benefit and who might be harmed? What happens to patients whose skin tones were underrepresented in training? - (S) If the AI misses a malignant lesion, who is liable — the AI developer, the hospital, or the doctor who relied on it?
D2. Identify one concern common to all four anchor systems.
Rubric criteria: A strong answer identifies a genuine cross-cutting concern (not a vague one) and explains why it appears across domains. Examples of valid answers: (1) All four systems make decisions about people based on historical data, and historical data encodes historical inequities. (2) All four involve automation of judgment that was previously made by humans, raising questions about accountability when errors occur. (3) All four affect people who had no say in the system's design or deployment — the affected populations are not the customers. A weak answer names a concern without explaining why it is systemic.
Chapter 2: A Brief History of AI
A1. What is the Turing Test?
Model answer: The Turing Test, proposed by Alan Turing in 1950, is a behavioral test for machine intelligence. In the test, a human evaluator communicates via text with both a machine and a human, without knowing which is which. If the evaluator cannot reliably distinguish the machine from the human, the machine is said to have "passed" the test. Turing proposed a behavioral test rather than defining "thinking" directly because he recognized that defining thinking is philosophically intractable. By focusing on observable behavior, he sidestepped the question of whether the machine is "really" thinking.
A3. What is an AI winter?
Model answer: An AI winter is a period of reduced funding, interest, and progress in AI research, typically following a cycle of overhyped promises and disappointing results. The sequence: (1) Researchers and advocates make ambitious claims about AI's potential. (2) Funding agencies and the public develop high expectations. (3) The technology fails to deliver on those expectations within the promised timeframe. (4) Disillusionment sets in. (5) Funding is cut, researchers leave the field or rebrand their work, and progress slows. Two major AI winters occurred: roughly 1974-1980 and 1987-1993.
B3. Three follow-up questions about "AI Now Outperforms Doctors at Diagnosing Skin Cancer."
Rubric criteria: Drawing on Pattern 4 ("Demonstration Is Not Deployment"), a strong response asks questions about the gap between controlled performance and real-world use. Examples: (1) Was this demonstrated in laboratory conditions on curated images, or in a real clinical setting with real patients? (2) How does the system perform on patient populations it was not trained on — different skin tones, rare conditions, images taken with different equipment? (3) Has any hospital actually deployed this system, and if so, what were the outcomes?
D4. Was the Dartmouth proposal vindicated?
Rubric criteria: A strong response avoids a simple yes/no and engages with specific evidence. Arguments for vindication: LLMs can write essays, pass exams, code software — demonstrating simulation of many "features of intelligence." Arguments against: no system has the flexibility, common sense, or transfer ability the proposal envisioned. A nuanced response notes that the proposal is partially vindicated (specific aspects of intelligence have been simulated) but the word "every" in "every aspect of learning" remains unachieved.
Chapter 3: How Machines Learn
A1. Define supervised, unsupervised, and reinforcement learning.
Model answer: Supervised learning: The model learns from labeled examples — input-output pairs where the correct answer is provided. Like learning from a teacher who shows you problems with solutions. Example: email spam classification, where each email is labeled spam or not-spam. Unsupervised learning: The model finds patterns and structures in data without labeled examples. Like sorting a jar of mixed buttons by discovering groupings based on color, size, and shape — without being told what the categories are. Example: customer segmentation from purchasing data. Reinforcement learning: An agent learns by taking actions in an environment and receiving rewards or penalties. Like learning to ride a bicycle — you get feedback (staying upright = good, falling = bad) through trial and error. Example: training a game-playing AI.
A4. Explain overfitting using a non-technical analogy.
Rubric criteria: The analogy must make clear two things: (1) excellent performance on familiar material and (2) poor performance on anything new. Example model answer: Overfitting is like a student who memorizes every question and answer from past exams instead of understanding the underlying concepts. On an exam that repeats past questions, they score perfectly. On an exam with new questions on the same material, they fail — because they learned the specific answers, not the underlying patterns.
B1. Identify the appropriate learning type for each task.
Model answers: (a) Unsupervised learning — no pre-defined categories means you need the model to discover structure. (b) Reinforcement learning — the robot needs to learn from trial and error in its environment, receiving feedback on what works. (c) Supervised learning — historical data with known outcomes (defaulted or not) provides labels. (d) Unsupervised learning (specifically anomaly detection) — you're looking for unusual patterns without labels for what "attack" looks like. (e) Reinforcement learning — the program learns from wins and losses.
Chapter 4: Data
A1. Define in your own words.
Model answers: (a) Structured data: Data organized in a fixed, predictable format — like a spreadsheet with rows and columns. Each entry has a defined type and location. Example: a hospital's patient database with columns for name, age, blood pressure, and diagnosis. (b) Unstructured data: Data without a predefined format or organization — like free-form text, images, audio, or video. A machine learning system must extract meaningful features from it. Example: a collection of customer email complaints. (c) Labeled data: Data that has been annotated with the correct answer or category for each example, used for supervised learning. Example: a dataset of X-ray images where each image is tagged "fracture" or "no fracture" by a radiologist. (d) Ground truth: The correct, verified real-world information against which an AI system's predictions are measured. Example: a biopsy result confirming whether a tumor is malignant, against which a diagnostic AI's prediction is compared.
B3. Proxy variable problem — bank loan AI.
Model answers: (a) Variables that could serve as proxies for race: zip code (due to residential segregation), educational institution attended (due to historical access disparities), and first name (naming conventions vary across racial groups). (b) Removing race is insufficient because the model can learn to use these proxy variables to reconstruct racial information indirectly. The patterns of discrimination are encoded in many correlated features, not just the explicit variable. (c) Alternative approaches: test the model's outcomes for disparate impact across racial groups regardless of whether race is an input; use fairness-aware algorithms that constrain the model to produce equitable outcomes; conduct regular bias audits on the model's decisions.
D1. The consent dilemma — argue the position you disagree with.
Rubric criteria: The key skill being tested is the ability to articulate the strongest version of an opposing argument. A strong response: (1) presents the opposing position charitably and accurately, (2) identifies its strongest supporting reasons, (3) does not construct a straw man, and (4) reflects honestly on what was learned from the exercise. The reflection should identify at least one genuine insight gained from engaging with the opposing view.
Chapter 5: Large Language Models
5.1. Explain next-token prediction to someone who has never heard of AI.
Model answer: Imagine you're playing a game where someone starts a sentence and you have to guess the next word. "The cat sat on the ___." You'd probably guess "mat" or "chair" because you've read millions of sentences and know what words typically follow others. A large language model does exactly this — it predicts the most likely next word, and then uses that prediction to predict the word after that, building a response one word at a time.
5.2. Temperature for legal contract vs. creative story.
Model answer: Legal contract: low temperature. You want the model to choose the most probable, conventional words and phrases. Legal language should be precise and predictable — creative variation is a liability. Creative story: high temperature. You want the model to make less predictable word choices, producing more surprising and original text. Higher temperature introduces more randomness into the selection process, which can produce more creative (though also less reliable) outputs.
5.7. RLHF evaluators from homogeneous backgrounds.
Rubric criteria: A strong response identifies at least two specific examples. Examples: (1) Cultural humor: A joke that is funny in one culture but offensive in another might be ranked differently. Western-centric evaluators might rate a model response as "good" even when it reflects cultural assumptions that would be inappropriate or inaccurate in other contexts. (2) Directness vs. indirectness: Some cultures value direct, assertive responses while others prefer indirect, contextual communication. Evaluators from direct-communication cultures might rank blunt answers higher, training the model to communicate in a style that feels rude in other cultural contexts.
Chapter 6: Computer Vision
A1. Pixel basics.
Model answers: (a) 100 x 100 = 10,000 pixel values. (b) 100 x 100 x 3 = 30,000 values. (c) A human seeing an image instantly recognizes objects, context, meaning, and emotion. A computer sees only a grid of numbers — it has no concept of "cat," "chair," or "danger." The numbers must be processed through multiple layers of computation before the computer can extract any meaningful patterns, and even then it is recognizing statistical regularities, not understanding what the image depicts.
B1. What CNN layers detect.
Model answer: (a) First layer: simple edges and lines — horizontal, vertical, and diagonal. (b) Middle layers: combinations of edges forming textures and shapes — curves, corners, patterns like fur or spots. (c) Final layers: complete object parts and whole objects — ears, snout, tail, then "dog" as a holistic concept. (d) Hierarchical detection is more effective because it builds complexity gradually, allowing the network to recognize the same object regardless of position, size, or orientation — much as human vision works.
Chapter 7: AI Decision-Making
7.1. Mode identification.
Model answers: (a) Recommendation — suggesting content the user might enjoy. (b) Classification — sorting luggage into "flagged" or "cleared" categories. (c) Prediction — forecasting a future value. (d) Classification — sorting posts into "violates" or "does not violate." (e) Prediction — estimating a future probability. (f) Recommendation — ranking items for a user. (g) Prediction — estimating a future event (patient deterioration).
7.2. False positive vs. false negative consequences.
Model answer (example — credit card fraud): (a) False positive: A legitimate transaction is flagged as fraud — the customer's card is temporarily blocked, causing inconvenience and potentially a missed purchase. False negative: A fraudulent transaction goes undetected — the customer or bank loses money and must deal with the aftermath. (b) In this context, false negatives are generally more harmful because they result in direct financial loss and the erosion of trust in the system. However, excessive false positives can also be harmful if they disproportionately affect certain populations (e.g., international travelers, people who make unusual purchasing patterns).
Chapter 8: When AI Gets It Wrong
8.1. Failure type classification.
Model answers: (a) Type 2: Hallucination — the AI generates plausible but fabricated information (a treaty that never existed). (b) Type 3: Distributional shift — the model was trained in clear Phoenix conditions and encounters fog, a condition outside its training distribution. (c) Type 1: Wrong answer — a straightforward error in matching due to changed appearance. (d) Type 4: Adversarial — the input was deliberately modified to fool the classifier. (e) Type 5: Cascading failure — an error in one system (transcription) propagates through another system (prescribing AI), compounding the harm. (f) Type 3: Distributional shift — language patterns changed between training period and deployment.
8.2. Confidence score scenarios.
Model answers: (a) Appropriate — the user treats the probability as a guide for preparation, not a certainty. (b) Inappropriate — a single confidence score should not be the sole basis for an academic dishonesty charge. The instructor should review the essay directly, consider other evidence, and give the student an opportunity to explain. Automation bias is at work here. (c) Appropriate — the doctor does not blindly accept the AI's assessment but orders a definitive test. This is a good example of human-AI collaboration. (d) Problematic — using a rigid numerical cutoff eliminates candidates based on an opaque score, without human judgment about fit, potential, or circumstances.
Chapter 9: Bias and Fairness
9.1. Pipeline stage identification.
Model answers: (a) Data collection / representation bias — African and Indigenous languages are underrepresented in training corpora because there is less digital text available in those languages. (b) Historical bias — 30 years of lending data encodes decades of documented discrimination. (c) Measurement bias — "cultural fit" is a subjective, poorly defined label that may encode homogeneity preferences rather than genuine job-relevant qualities. (d) Representation bias — 94% white patients means the model has not learned diagnostic patterns for other groups. (e) Deployment / feedback loop — optimizing for clicks (the deployment metric) systematically promotes sensational content, which may disproportionately affect certain communities.
9.3. Fairness trade-off — surgical follow-up.
Model answers: (a) Demographic parity would flag the same proportion of older and younger patients, even though older patients actually need follow-up at four times the rate. This would result in many older patients being missed (false negatives) and many younger patients being unnecessarily flagged (false positives). Older patients who need care would be harmed. (b) Calibration preserves the meaning of risk scores across groups. A score of 70 means 70% probability regardless of age. But since the base rate is higher for older patients, more of them will receive high scores — so the system will flag a larger proportion of older patients. This is statistically appropriate but may look like age discrimination to someone unfamiliar with the math. (c) Equalized odds ensures equal error rates across groups. But if the groups have different base rates, this means the system must accept different accuracy levels in different regions of the score distribution. (d) In this context, calibration is arguably most appropriate because the purpose is clinical — you want the score to accurately reflect the actual probability of complications for each individual patient. Misleading scores could lead to worse health outcomes.
Chapter 10: AI and Work
Selected rubric criteria for applied exercises:
A strong response to exercise questions in this chapter should: (1) distinguish between tasks and jobs; (2) acknowledge that historical automation has generally transformed rather than eliminated employment; (3) identify specific tasks within an occupation that are or are not susceptible to automation, rather than making blanket claims about entire professions; and (4) address distributional impacts — who within the workforce is most affected and why.
Chapters 11–16: Selected Rubric Criteria
For the open-ended exercises in Chapters 11 through 16, strong responses should:
- Ch. 11 (Creativity): Distinguish between what AI does (recombine patterns from training data) and what creativity is (a contested concept). Avoid both the extreme that AI is "truly creative" and the extreme that AI-generated content is "worthless."
- Ch. 12 (Privacy): Analyze privacy as a power relationship, not just a personal preference. Identify who collects data, who benefits from it, and who is harmed by its collection. Go beyond "I have nothing to hide."
- Ch. 13 (Governance): Compare regulatory approaches using specific criteria (enforcement mechanisms, scope, flexibility, democratic input) rather than vague claims. Acknowledge trade-offs between innovation and protection.
- Ch. 14 (Using AI): Demonstrate the ability to evaluate AI outputs critically, not just accept or reject them wholesale. Show awareness of when AI tools are appropriate and when they are not.
- Ch. 15 (Healthcare): Integrate evidence about AI diagnostic accuracy with concerns about equity, trust, and the doctor-patient relationship. Avoid both uncritical techno-optimism and blanket rejection of healthcare AI.
- Ch. 16 (Education): Consider perspectives of students, educators, and administrators. Address equity implications — who benefits from educational AI and who is left behind.
Chapter 17: AI and Justice
A1. Crime data vs. policing data.
Model answer: "Crime data" implies an objective record of criminal activity, but what we actually have is "policing data" — records of police activity, including arrests, reports, and patrols. These are not the same thing. Crime data reflects which crimes were detected and reported, which depends heavily on where police patrol, which communities call for service, and what offenses are prioritized. In neighborhoods with heavy police presence, more minor offenses are detected and recorded, inflating apparent crime rates. In neighborhoods with less policing, identical behavior goes unrecorded. An AI trained on policing data learns where police have been, not necessarily where crime actually occurs.
A6. The accountability gap.
Model answer: The accountability gap is the situation in which no single actor in the AI system's development, deployment, and use chain accepts responsibility for harmful outcomes. Four actors in the chain for a sentencing risk assessment tool: (1) The tool's developer may claim the tool only provides information — the judge makes the decision. (2) The judge may claim they were relying on a validated, scientifically developed tool. (3) The government agency that procured the tool may claim they followed best practices by adopting evidence-based tools. (4) The validation researchers may claim they validated accuracy on average but cannot guarantee individual outcomes. Each actor can plausibly deflect responsibility, leaving the person harmed by an incorrect assessment with no clear path to accountability.
D1. Which fairness metric for criminal justice?
Rubric criteria: There is no single "right" answer. A strong response: (1) names a specific metric, (2) explains why it is most appropriate in this context, (3) honestly acknowledges the trade-offs, and (4) addresses how affected communities should be involved. For example: an argument for equalized odds (equal false positive rates) emphasizes that the justice system should not impose higher false accusation rates on any group. An argument for calibration emphasizes that a risk score should mean the same thing regardless of race. Both are defensible; the key is demonstrating awareness that the choice is a value judgment, not a technical solution.
Chapters 18–21: Selected Rubric Criteria
- Ch. 18 (Environment): Strong responses quantify claims when possible, distinguish between training and inference costs, and acknowledge the tension between AI as a contributor to and potential solution for environmental problems.
- Ch. 19 (Global Perspectives): Strong responses avoid Western-centric framing, consider the Global South as both a site of impact and a source of innovation, and analyze the power dynamics embedded in AI supply chains.
- Ch. 20 (Safety): Strong responses distinguish between near-term safety issues (which have documented evidence) and long-term existential risk (which is more speculative), and engage seriously with both rather than dismissing either.
- Ch. 21 (Road Ahead): Strong responses synthesize concepts from multiple chapters, apply durable frameworks rather than making technology-specific predictions, and articulate a personal commitment grounded in specific actions rather than vague aspirations.
A Note on Grading
These answers and rubrics are guidelines, not rigid scoring keys. Many questions in this textbook are deliberately open-ended because AI literacy requires the ability to form and defend positions on contested questions. An answer that disagrees with the model answer above but is well-reasoned, evidence-based, and intellectually honest can be excellent. The goal is not to produce a "correct" opinion about AI but to demonstrate the ability to think critically, evaluate evidence, consider multiple perspectives, and communicate clearly.