Answers to Selected Exercises

This appendix provides model answers for selected exercises from each chapter. Answers are chosen to illustrate reasoning processes, not merely final conclusions. For Python chapters (10, 14, 15, 22, 27, 29, 34), code solutions are included with explanations.

How to use this appendix: Attempt each exercise fully before consulting the answer. The value of these exercises lies in the reasoning process. Where an answer says "your response should include," this indicates that multiple valid approaches exist and the answer provides one model rather than the only correct response.


Chapter 1: The Data All Around Us

A.1. A dataset is a reduction because it captures only selected aspects of the reality it represents, much as a photograph captures only one angle, at one moment, under one set of lighting conditions. What is included reflects the priorities, assumptions, and limitations of whoever collected the data. An example: a university's student records dataset includes GPA, major, and enrollment status, but omits students' motivations for attending, their family obligations, their commute times, or their mental health. For students who withdraw due to caregiving responsibilities, what the dataset leaves out -- the reason for withdrawal -- may be more significant than what it captures, because the absence of that information leads administrators to attribute attrition to academic failure rather than structural barriers.

A.3. The claim that metadata "isn't a big deal" is incorrect because metadata, in aggregate, can reveal as much or more than content. The MetaPhone study (de Montjoye et al.) demonstrated that phone call metadata -- who called whom, when, for how long -- could accurately infer sensitive attributes including medical conditions, religious affiliations, and political activities. A single metadata record may seem innocuous; an entire year of metadata creates a detailed behavioral portrait. Furthermore, metadata is often collected and retained more extensively than content because organizations treat it as non-sensitive, which means it is typically subject to weaker access controls and longer retention periods.

A.5. The most relevant reason is that data exhaust is generated without deliberate intent and is therefore not covered by meaningful consent. Eli's experience with Smart City sensors illustrates this: the sensors collect data about his movements as an incidental byproduct of his daily life -- walking to work, visiting friends, spending time in public spaces. He did not choose to generate this data, was not asked for consent, and may not even be aware it is being collected. This exemplifies the chapter's point that data exhaust creates a surveillance infrastructure that operates beneath the threshold of awareness.


Chapter 2: A Brief History of Data and Society

2.1. The Latin word censere means "to assess" or "to tax." This etymology reveals that data collection has been linked to state power from its inception -- the census was not a neutral exercise in counting people but a mechanism for extracting resources (taxes) and asserting authority (military conscription). Understanding this origin matters because it challenges the view that data collection is inherently benign or merely administrative; it reminds us that the purpose of collecting data about people has always been, at least in part, to exercise power over them.

2.4. Digital enclosure refers to the process by which activities that previously occurred outside the reach of data collection systems have been brought within them through digitization. It differs from earlier data collection in three ways: (1) scale -- digital enclosure captures billions of interactions per day, not periodic snapshots; (2) awareness -- many people do not realize their activities are being recorded, unlike a census where the act of data collection is explicit; and (3) consent -- earlier forms of data collection typically involved some awareness and tacit or explicit participation, while digital enclosure captures data passively, often through terms of service that few people read.

2.8. The British colonial census in India illustrates how data collection can construct the social reality it claims to merely describe. Before the census, caste was a fluid, regionally varied system with significant local variation. The census required administrators to assign every individual to a fixed caste category, creating a pan-Indian taxonomy that did not previously exist. This process of classification had real consequences: it hardened previously flexible social boundaries, created new inter-caste hierarchies (because census categories were ranked), and gave communities incentives to campaign for higher classification. The mechanism is performativity: a data system that claims to describe social categories can, through its authority and institutional effects, create and solidify those categories. A contemporary parallel is how social media platforms' identity categories (gender options, relationship statuses, political affiliations) shape how people understand and present their own identities.


Chapter 3: Who Owns Your Data?

A.1. The Lockean labor theory holds that you own what you create through your labor, mixed with what nature provides. Applied to data, this generates a puzzle: when you use a fitness tracker, your body produces the physiological data (your heartbeat, your steps), but the company built the device and the algorithms that process and interpret the data. The question of who "labored" to produce the data has no clear answer because the data would not exist without either the user's body or the company's technology. This is why simple property theories struggle with data ownership -- data is co-produced, making single-party ownership claims inherently problematic.

A.5. Indigenous data sovereignty asserts that indigenous peoples have the right to govern data about their communities, lands, and cultural knowledge according to their own values and governance structures. This is not simply a privacy claim; it is a sovereignty claim rooted in the recognition that data has historically been used as an instrument of colonial power over indigenous peoples. The CARE Principles (Collective benefit, Authority to control, Responsibility, Ethics) operationalize this sovereignty. The chapter argues that indigenous data sovereignty challenges the Western assumption that data governance is primarily about individual rights, because indigenous frameworks center collective rights, relational accountability, and intergenerational obligations.


Chapter 4: The Attention Economy

A.3. A dark pattern is a user interface design that manipulates users into actions they did not intend or would not choose if fully informed. Dark patterns differ from persuasion in that persuasion presents information to influence a voluntary choice, while dark patterns exploit cognitive biases to circumvent informed choice. The chapter identifies five categories: (1) misdirection -- drawing attention away from important information, (2) hidden costs -- concealing charges or consequences, (3) forced continuity -- making cancellation difficult, (4) confirmshaming -- using guilt to discourage an opt-out, and (5) roach motel -- making it easy to subscribe but hard to leave. Each of these is problematic not merely because it produces a bad outcome for the user, but because it undermines the autonomy that consent is supposed to protect.

B.3. Variable-ratio reinforcement schedules -- the same mechanism that makes slot machines addictive -- are embedded in social media feeds through unpredictable reward timing. The user does not know when the next "like," comment, or interesting post will appear, which makes the act of scrolling itself compulsive. From a governance perspective, this is significant because it calls into question whether user engagement metrics (time on site, daily active users) actually measure the value a service provides. If engagement is driven by compulsive behavior rather than genuine preference, then the standard economic argument -- that engagement demonstrates user satisfaction -- collapses. The governance implication is that engagement metrics cannot serve as evidence that users are making voluntary choices to spend their time on a platform.


Chapter 5: Power, Knowledge, and Data

A.1. Foucault's power/knowledge concept holds that power and knowledge are not separate phenomena but are mutually constitutive: power shapes what counts as knowledge, and knowledge enables the exercise of power. In data governance, this manifests in three ways discussed in the chapter. First, the power to define categories (what data to collect, how to label it, what counts as "normal") shapes what can be known. Second, the knowledge produced by data systems (risk scores, behavioral predictions) enables institutions to exercise power over individuals. Third, the asymmetry of knowledge -- organizations know much about individuals while individuals know little about the algorithms applied to them -- is itself a form of power. This is why data governance cannot be reduced to technical questions about data quality or security; it is fundamentally about who has the power to define, collect, analyze, and act on knowledge about others.

A.5. Data asymmetry occurs when one party has significantly more data about the other party than vice versa. The chapter distinguishes between informational asymmetry (differential access to information) and analytical asymmetry (differential capacity to process and act on information). A platform like a ride-sharing service has both: it knows the rider's location, travel history, spending patterns, and price sensitivity (informational), and it has algorithms capable of using that information to optimize pricing in ways the rider cannot detect or counter (analytical). The governance concern is that these asymmetries compound: the more data an organization has, the better its analytical capabilities, and the better its analytics, the more effectively it can extract value from data subjects.


Chapter 6: Ethical Frameworks for the Data Age

A.1. Law and ethics do not perfectly overlap because each has a different basis, scope, and enforcement mechanism. Legal but unethical: a company collects extensive personal data through technically compliant consent forms designed to be unreadable -- it has satisfied the legal requirement but violated the ethical principle of informed consent. Ethical but illegal: a data protection officer in an authoritarian country leaks evidence that the government is using census data to target ethnic minorities -- the leak violates data protection law but serves an ethical obligation to prevent harm. Legal vacuum: a company uses AI to generate synthetic faces for deepfake advertising without the consent of the people whose faces were used in training data -- in many jurisdictions, no law specifically prohibits this, but it raises serious ethical questions about consent, dignity, and deception.

A.4. Kant's categorical imperative in the universalizability formulation: "Act only according to that maxim by which you can at the same time will that it should become a universal law." Applied to the A/B testing scenario: if we universalize the maxim "experiment on users without their knowledge to maximize engagement," we get a world where every service secretly experiments on every user, and no one can trust that any interface they use reflects the service's honest design rather than a manipulative experiment. This undermines the conditions of trust on which voluntary use depends -- the maxim is self-defeating. In the humanity formulation: "Act so that you treat humanity, whether in your own person or in that of another, always as an end and never merely as a means." The company treats users merely as means to the end of maximizing engagement metrics; the users are instruments of the company's optimization goals, and the experimental manipulation denies them the information they would need to exercise autonomous choice.


Chapter 7: What Is Privacy?

A.4. Nissenbaum's contextual integrity framework holds that privacy is violated when information flows in ways that breach the norms appropriate to a given social context. Every social context (healthcare, education, friendship, commerce) has its own informational norms governing: (1) context -- the social domain in which interaction occurs, (2) actors -- the sender, receiver, and subject of information, (3) attributes -- the type of information, (4) transmission principles -- the constraints on how information flows (e.g., with consent, as required by law, in confidence), and (5) the specific norm that integrates these parameters. A privacy violation occurs when information flows in a way that violates the norms governing that context. The strength of this framework is that it explains why the same information (e.g., a person's location) can be appropriate in one context (a friend asking "where are you?") and a violation in another (a data broker selling location history to advertisers).

B.1. Applying contextual integrity to the university health center app: (1) Context: healthcare, specifically student health services within a university. (2) Existing norms: health information shared with providers is governed by confidentiality norms; the transmission principle is "shared only for treatment purposes with authorized healthcare providers." (3) New practice: the app developer retains visit-reason data and sells aggregate statistics to pharmaceutical companies. (4) Comparison: the new practice violates contextual norms because it introduces a new actor (pharmaceutical companies) who is not part of the healthcare context, changes the transmission principle from "for treatment" to "for commercial marketing," and repurposes data collected under healthcare norms for a commercial context with different governing norms. (5) Evaluation: even though the data is aggregate, the flow violates contextual integrity because the pharmaceutical companies are not appropriate recipients of healthcare information, and the commercial marketing purpose is not an appropriate transmission principle for the healthcare context. The violation is not remedied by aggregation because the norms were breached at the point of transfer, not at the point of identification.


Chapter 8: Surveillance: From Panopticon to Platform

A.1. Bentham's panopticon was a circular prison in which a central watchtower could observe all cells, but prisoners could not see whether anyone was actually watching. The governance mechanism is uncertainty itself: because inmates do not know when they are being observed, they internalize the possibility of observation and regulate their own behavior. Foucault extended this insight beyond prisons to theorize disciplinary power -- power that operates not through physical coercion but through the production of self-regulating subjects. The relevance to data governance is that digital surveillance systems (workplace monitoring, social media tracking, smart city sensors) create a similar structure: users and citizens cannot see the surveillance apparatus, do not know when or how their data is being analyzed, and modify their behavior in response to the possibility of being watched. This chilling effect on behavior -- the suppression of dissent, experimentation, and nonconformity -- is a governance harm even when no individual act of surveillance causes direct injury.

A.5. Dataveillance, coined by Roger Clarke, refers to the systematic monitoring of people's activities through their data trails rather than through direct physical observation. Unlike traditional surveillance, dataveillance can be conducted at scale, retroactively, and without the subject's awareness. The chapter identifies three characteristics that make dataveillance qualitatively different from physical surveillance: (1) it is persistent -- data trails accumulate over time, creating a permanent record that can be analyzed retrospectively; (2) it is ambient -- it occurs as a byproduct of everyday activities like browsing, purchasing, and communicating, rather than requiring deliberate surveillance infrastructure; (3) it is combinable -- data from multiple sources can be merged to create profiles far more detailed than any single source could produce.


A.1. The notice-and-consent model, as described in the chapter, consists of two elements: (1) the data collector provides notice (typically through a privacy policy) of what data will be collected, how it will be used, and with whom it will be shared; and (2) the data subject provides consent (typically by clicking "I agree" or continuing to use a service). The chapter identifies five structural failures of this model: consent fatigue (the volume of consent requests makes informed reading impossible), comprehension failure (privacy policies are written at a college reading level and run thousands of words), power asymmetry (the terms are non-negotiable), take-it-or-leave-it framing (users cannot modify terms, only accept or forgo the service), and the fiction of ongoing consent (a one-time click is treated as perpetual authorization for evolving uses). These failures mean that notice-and-consent produces the legal appearance of user autonomy without its substantive reality.

B.3. Dark patterns in consent interfaces violate all three conditions of meaningful consent: voluntariness, information, and competence. Voluntariness is undermined when opt-out mechanisms are hidden or when confirmshaming language ("No, I don't want to save money") pressures users into the option the company prefers. Information is undermined when privacy-relevant choices are buried deep in settings menus or when the consequences of each option are described in misleading language. Competence is undermined when interface design exploits cognitive biases -- for example, making the "accept all cookies" button large and colorful while making the "manage preferences" option small and gray. The GDPR's requirement for "freely given, specific, informed, and unambiguous" consent was designed to address these problems, but enforcement has been inconsistent, and many organizations have responded by creating more sophisticated consent theaters rather than genuinely empowering user choice.


Chapter 10: Privacy by Design and Data Minimization

A.1. Cavoukian's seven principles: (1) Proactive, not reactive -- anticipate and prevent privacy risks before they materialize, rather than responding after harm occurs. (2) Privacy as default -- if the user does nothing, privacy is protected; no action is required to secure privacy. (3) Privacy embedded into design -- privacy is built into the system architecture, not bolted on afterward. (4) Full functionality (positive-sum) -- privacy and other objectives (security, usability) can coexist without trade-offs. (5) End-to-end security -- data is protected throughout its entire lifecycle, from collection to deletion. (6) Visibility and transparency -- practices are open to scrutiny and verification. (7) Respect for user privacy -- the individual's interests are paramount. Each differs from a reactive approach because a reactive approach treats privacy as a constraint to be managed after the system is built, while PbD treats privacy as a design requirement that shapes the system from inception.

A.4. A dataset satisfies k-anonymity if every combination of quasi-identifiers (attributes that could be used for re-identification, such as age, zip code, gender) appears in at least k records. For example, 3-anonymity means every unique combination of quasi-identifiers matches at least three records, so an attacker who knows someone's quasi-identifiers cannot narrow the subject down to fewer than three people. 1-anonymity is effectively no anonymity because each combination of quasi-identifiers identifies a unique individual -- this is the state of most raw datasets.

C.1. [PYTHON] k-Anonymity Checker:

import pandas as pd

def check_k_anonymity(df, quasi_identifiers):
    """
    Check the k-anonymity level of a DataFrame.

    Groups records by quasi-identifier columns and returns the
    size of the smallest group -- this is the k-value the dataset achieves.

    Args:
        df: pandas DataFrame containing the data
        quasi_identifiers: list of column names used as quasi-identifiers

    Returns:
        int: the minimum group size (k-value)
    """
    # Group by quasi-identifiers and count records in each group
    group_sizes = df.groupby(quasi_identifiers).size()
    # The k-value is the smallest group
    return group_sizes.min()

# Test dataset
data = {
    'age': [25, 25, 30, 30, 30, 35, 35, 35, 40, 40],
    'zipcode': ['10001', '10001', '10002', '10002', '10002',
                '10003', '10003', '10003', '10004', '10004'],
    'gender': ['M', 'M', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'F'],
    'diagnosis': ['Flu', 'Cold', 'Flu', 'Diabetes', 'Cold',
                  'Flu', 'Flu', 'Cancer', 'Cold', 'Flu']
}
df = pd.DataFrame(data)

k = check_k_anonymity(df, ['age', 'zipcode', 'gender'])
print(f"k-anonymity level: {k}")
# Output: k-anonymity level: 2

The dataset satisfies 2-anonymity for quasi-identifiers ['age', 'zipcode', 'gender']. The groups (25, 10001, M) and (40, 10004, F) each have exactly 2 records, which is the minimum group size. The group (35, 10003, M) has 2 records, (30, 10002, F) has 3, and (35, 10003, F) has 1. Wait -- let us recount: age 35, zipcode 10003, gender M = rows 6 and 7 (indices 5,6) = 2 records. Age 35, zipcode 10003, gender F = row 8 (index 7) = 1 record. This means the dataset actually achieves only 1-anonymity because the group (35, 10003, F) has a single record, making that individual uniquely identifiable. The k-value is 1.

C.4. [PYTHON] Laplace Mechanism:

import numpy as np

def laplace_mechanism(true_value, sensitivity, epsilon):
    """
    Add Laplace noise to a true value for differential privacy.

    Args:
        true_value: The actual query result
        sensitivity: Maximum change from adding/removing one record
        epsilon: Privacy budget (smaller = more private)

    Returns:
        float: The noisy result
    """
    scale = sensitivity / epsilon
    noise = np.random.laplace(0, scale)
    return true_value + noise

# Simulation: hospital with 127 diabetic patients out of 1000
true_count = 127
sensitivity = 1  # counting query

for eps in [0.01, 0.1, 0.5, 1.0, 5.0]:
    results = [laplace_mechanism(true_count, sensitivity, eps)
               for _ in range(1000)]
    mean = np.mean(results)
    std = np.std(results)
    print(f"epsilon={eps:5.2f}  mean={mean:7.2f}  std={std:7.2f}")

# Expected output (approximate):
# epsilon= 0.01  mean= 126.54  std= 99.87
# epsilon= 0.10  mean= 127.12  std=  9.98
# epsilon= 0.50  mean= 127.03  std=  2.01
# epsilon= 1.00  mean= 127.01  std=  1.00
# epsilon= 5.00  mean= 127.00  std=  0.20

As epsilon decreases (stronger privacy), the standard deviation increases dramatically, and individual query results become less accurate. At epsilon=0.01, the noise standard deviation is approximately 100 -- meaning the noisy answer could be anywhere from near 0 to 250, rendering any single response nearly useless. At epsilon=5.0, the standard deviation is only about 0.2, providing very accurate answers but minimal privacy protection. This demonstrates the fundamental privacy-accuracy trade-off: there is no free lunch.


Chapter 11: The Economics of Privacy

A.1. A privacy externality occurs when the privacy costs of a transaction are borne by parties not involved in the transaction. When User A shares their contact list with a social media app, the privacy costs fall on Users B, C, and D -- whose names, phone numbers, and social connections have been disclosed without their consent. Because User A does not bear these costs, the market provides no mechanism to account for them. This is why the chapter argues that privacy has the structure of a public good: each individual's privacy is partly determined by others' decisions, and the aggregate privacy level cannot be set by individual market transactions.

A.5. The privacy paradox describes the observed gap between people's stated privacy preferences (surveys consistently show high concern about data privacy) and their actual behavior (people routinely disclose personal information for small benefits and rarely read privacy policies). The chapter presents three explanations: (1) rational ignorance -- the costs of reading and understanding privacy policies exceed the expected benefits, so rational actors skip them even if they care about privacy; (2) temporal discounting -- privacy costs are uncertain and future, while the benefits of service use are immediate and concrete; (3) structural constraints -- in many cases, opting out of data collection means opting out of essential services, so "choosing" to disclose data is not a genuine preference revelation but a response to a lack of alternatives.


Chapter 12: Health Data, Genetic Data, and Biometric Privacy

A.1. HIPAA (Health Insurance Portability and Accountability Act, 1996) establishes national standards for the protection of protected health information (PHI). The Privacy Rule applies to "covered entities" -- health plans, healthcare clearinghouses, and healthcare providers who transmit information electronically -- and their "business associates." A key limitation is that HIPAA's scope is defined by the entity type, not the data type: a fitness app that collects heart rate data is not a covered entity and is therefore not bound by HIPAA, even though the data it collects may be medically significant. This entity-based scope means that increasingly consequential health data falls outside HIPAA's protections as health monitoring shifts from clinical settings to consumer devices.

A.5. Biometric data is considered uniquely sensitive for three reasons identified in the chapter: (1) immutability -- unlike a password or credit card number, you cannot change your fingerprints, iris patterns, or facial geometry if they are compromised; a biometric data breach is permanent. (2) universality -- biometric identifiers are inherently linked to the physical person, making anonymization nearly impossible; unlike a pseudonymous ID, a fingerprint can always be matched back to its source. (3) ambient collection -- certain biometrics, particularly facial geometry, can be captured at a distance without the subject's awareness or cooperation, enabling surveillance without consent. The governance implication is that biometric data requires protections more stringent than those applied to other personal data categories.


Chapter 13: How Algorithms Shape Society

A.3. Algorithmic gatekeeping is the process by which algorithms determine what information reaches which audiences. Unlike a newspaper editor, who makes conscious decisions about newsworthiness based on professional judgment and can be held personally accountable, algorithmic gatekeepers are more powerful because: (1) they operate at scale, making billions of content decisions per day across millions of users; (2) they are personalized, showing different information to different users based on inferred preferences, which means no two people see the same reality. They are potentially less accountable because: (1) the decision-making criteria are proprietary and opaque; (2) no single person is responsible for any particular content decision; and (3) the platforms claim to be neutral intermediaries rather than editorial decision-makers, which has historically shielded them from the accountability applied to publishers.

A.7. "Our platform leverages advanced predictive analytics and personalization to deliver optimized, data-driven experiences." Decoded: "We use your behavioral data to predict what you will click on, show you content chosen to maximize the time you spend on our platform, and describe the result as a service to you rather than a mechanism for extracting your attention and selling it to advertisers."


Chapter 14: Bias in Data, Bias in Machines

A.2. (a) Representation bias -- African languages are underrepresented in training data. (b) Measurement bias -- the variable being measured (drug efficacy) was assessed on a non-representative sample. (c) Historical bias -- the training data encodes past discrimination in hiring decisions. (d) Measurement bias -- zip code is a proxy variable that measures socioeconomic status and race rather than the construct it purports to capture. (e) Evaluation bias -- the benchmark dataset's composition skews accuracy metrics to favor the majority group. (f) Deployment bias -- the model is used in a context different from the one in which it was developed and validated.

A.7. Selection rate for male applicants: 55%. Selection rate for female applicants: 38%. Disparate impact ratio = lower rate / higher rate = 38% / 55% = 0.691. The four-fifths threshold is 0.80. Because 0.691 < 0.80, the system triggers the four-fifths threshold, indicating potential disparate impact against female applicants.

C.1. [PYTHON] Selection Rate Calculator:

def calculate_selection_rates(predictions, groups):
    """
    Calculate selection rates per group.

    Args:
        predictions: list of 0s and 1s (selected or not)
        groups: list of group labels

    Returns:
        dict mapping group labels to selection rates
    """
    group_counts = {}
    group_selected = {}

    for pred, group in zip(predictions, groups):
        group_counts[group] = group_counts.get(group, 0) + 1
        group_selected[group] = group_selected.get(group, 0) + pred

    return {
        group: group_selected[group] / group_counts[group]
        for group in group_counts
    }

predictions = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0]
groups = ["A"]*10 + ["B"]*10

rates = calculate_selection_rates(predictions, groups)
for group, rate in sorted(rates.items()):
    print(f"Group {group} selection rate: {rate:.2f}")
# Output:
# Group A selection rate: 0.60
# Group B selection rate: 0.40

Chapter 15: Fairness -- Definitions, Tensions, and Trade-offs

A.1. ProPublica argued COMPAS was unfair because Black defendants had a much higher false positive rate (being incorrectly predicted as high-risk) than white defendants (approximately 45% vs. 24%). They were using equalized odds -- the standard that error rates should be equal across groups. Northpointe argued COMPAS was fair because among defendants classified as high-risk, Black and white defendants reoffended at similar rates (approximately 63% for both). They were using calibration -- the standard that a given risk score should mean the same thing regardless of group. Both were mathematically correct: COMPAS was calibrated but violated equalized odds. The impossibility theorem explains why: when base rates differ (Black defendants had higher base rates due to systemic factors), calibration and equalized odds cannot be simultaneously satisfied.

A.5. The impossibility theorem, proved by Chouldechova (2017) and Kleinberg, Mullainathan, and Raghavan (2016), demonstrates that when two groups have different base rates (different proportions of positive cases), it is mathematically impossible for a classifier to simultaneously satisfy calibration (equal PPV across groups) and equalized odds (equal FPR and FNR across groups). The theorem does not apply when base rates are equal between groups -- in that case, all three criteria can be satisfied simultaneously. This condition is rarely met in practice because the social phenomena these systems predict (recidivism, loan default, disease risk) have different base rates across demographic groups, precisely because of the structural inequalities that motivate fairness concerns in the first place.

C.1. [PYTHON] Demonstrating the Impossibility Theorem:

from dataclasses import dataclass, field
import pandas as pd

@dataclass
class FairnessCalculator:
    """Calculate and compare fairness metrics across groups."""
    predictions: list
    actuals: list
    groups: list

    def metrics_by_group(self) -> dict:
        results = {}
        for group in set(self.groups):
            idx = [i for i, g in enumerate(self.groups) if g == group]
            preds = [self.predictions[i] for i in idx]
            acts = [self.actuals[i] for i in idx]
            tp = sum(1 for p, a in zip(preds, acts) if p == 1 and a == 1)
            fp = sum(1 for p, a in zip(preds, acts) if p == 1 and a == 0)
            tn = sum(1 for p, a in zip(preds, acts) if p == 0 and a == 0)
            fn = sum(1 for p, a in zip(preds, acts) if p == 0 and a == 1)
            n = len(idx)
            results[group] = {
                'n': n, 'tp': tp, 'fp': fp, 'tn': tn, 'fn': fn,
                'base_rate': sum(acts) / n,
                'selection_rate': sum(preds) / n,
                'tpr': tp / (tp + fn) if (tp + fn) > 0 else 0,
                'fpr': fp / (fp + tn) if (fp + tn) > 0 else 0,
                'ppv': tp / (tp + fp) if (tp + fp) > 0 else 0,
            }
        return results

    def report(self) -> str:
        m = self.metrics_by_group()
        lines = ["Fairness Report", "=" * 40]
        for g, v in sorted(m.items()):
            lines.append(f"\nGroup {g} (n={v['n']}):")
            lines.append(f"  Base rate:      {v['base_rate']:.3f}")
            lines.append(f"  Selection rate: {v['selection_rate']:.3f}")
            lines.append(f"  TPR:            {v['tpr']:.3f}")
            lines.append(f"  FPR:            {v['fpr']:.3f}")
            lines.append(f"  PPV:            {v['ppv']:.3f}")
        return "\n".join(lines)

# Scenario 1: Calibrated but violates equalized odds
# Group A: 200 people, 40% base rate (80 positive)
# Group B: 200 people, 20% base rate (40 positive)
# Calibrated: PPV ~ 0.67 for both groups
preds_a = [1]*60 + [0]*140  # 60 predicted positive
acts_a = [1]*40 + [0]*20 + [1]*40 + [0]*100  # 40 TP, 20 FP, 40 FN
preds_b = [1]*30 + [0]*170
acts_b = [1]*20 + [0]*10 + [1]*20 + [0]*150  # 20 TP, 10 FP, 20 FN

preds = preds_a + preds_b
acts = acts_a + acts_b
groups = ['A']*200 + ['B']*200

fc = FairnessCalculator(preds, acts, groups)
print("SCENARIO 1: Calibrated predictions")
print(fc.report())
# PPV for both groups ~ 0.67 (calibrated)
# But TPR differs: A = 40/80 = 0.50, B = 20/40 = 0.50
# FPR differs: A = 20/120 = 0.167, B = 10/160 = 0.063
# Equalized odds violated (FPR differs)

This demonstration shows that when base rates differ (40% vs. 20%), achieving calibration (equal PPV) forces unequal error rates, violating equalized odds. Adjusting predictions to equalize FPR and TPR would necessarily produce different PPV values across groups. The impossibility is mathematical, not a failure of algorithm design.


Chapter 16: Transparency, Explainability, and the Black Box Problem

A.1. Transparency and explainability, though related, are distinct concepts. Transparency refers to the availability of information about a system -- its source code, training data, decision criteria, and performance metrics are accessible for examination. Explainability refers to the ability to provide a human-understandable account of why a specific decision was made. A system can be transparent but not explainable (e.g., a neural network whose source code is public but whose individual decisions cannot be traced to specific features), and a system can be explainable but not transparent (e.g., a proprietary system that provides LIME-style local explanations without revealing its full architecture). The governance implication is that demanding "transparency" without specifying which type -- and for which audience -- can result in transparency theater that satisfies legal requirements without enabling genuine understanding or accountability.


Chapter 17: Accountability and Audit

A.3. An algorithmic impact assessment (AIA) is a structured evaluation conducted before deployment to identify, evaluate, and mitigate the potential harms of an algorithmic system. The chapter identifies six components: (1) system description (what the system does and how), (2) stakeholder identification (who is affected, including non-users), (3) harm assessment (potential negative impacts, including disparate impacts on specific groups), (4) mitigation plan (steps to reduce identified harms), (5) monitoring plan (how the system will be evaluated post-deployment), and (6) public accountability (how findings will be communicated to affected parties and the public). AIAs differ from technical audits in that they focus on social impacts, not just system performance, and they require engagement with affected communities, not just technical evaluation.


Chapter 18: Generative AI: Ethics of Creation and Deception

A.1. The distinction between generative and discriminative AI is foundational. Discriminative AI classifies or predicts based on existing data (e.g., spam detection, risk scoring). Generative AI creates new content -- text, images, audio, video -- that resembles training data but is not a direct copy. The governance implications are distinct because generative AI raises questions that discriminative AI does not: who owns the output? Can generated content be used for deception (deepfakes)? Does the training process infringe on the rights of creators whose work was used as training data? The chapter argues that generative AI represents a qualitative shift in data governance challenges because it introduces creation as a governance concern, not just collection, storage, and analysis.

A.5. AI hallucination occurs when a generative AI system produces output that is confident, fluent, and plausible but factually incorrect. The governance concern is that hallucinations are indistinguishable from accurate outputs to users who lack independent verification capacity. In contexts like healthcare (where VitraMed's systems might generate clinical summaries), legal advice, or education, hallucinated information can cause direct harm. The chapter identifies three governance responses: (1) disclosure requirements (systems must identify themselves as AI-generated), (2) verification infrastructure (outputs must be checkable against authoritative sources), and (3) liability frameworks (someone must be accountable when hallucinated information causes harm).


Chapter 19: Autonomous Systems and Moral Machines

A.3. The Moral Machine experiment (MIT, Awad et al., 2018) presented respondents in 233 countries with trolley-problem-style dilemmas involving autonomous vehicles and recorded their moral preferences. Key findings: (1) there was strong cross-cultural agreement on saving more lives over fewer; (2) there was significant cross-cultural variation on other dimensions -- respondents in individualist cultures preferred saving younger people, while respondents in collectivist cultures showed less age bias; (3) cultural, economic, and institutional variables (not just individual preferences) predicted moral judgments. The governance implication is that autonomous systems' moral parameters cannot simply be set to match "universal human values" because no such universal consensus exists on many relevant dimensions. This makes the process by which moral parameters are set -- who decides, through what deliberative mechanism -- as important as the parameters themselves.


Chapter 20: The Regulatory Landscape

A.3. (a) Command-and-control: the regulator specifies detailed rules and enforces compliance. The GDPR, with its specific requirements for consent, data protection officers, and breach notification timelines, exemplifies this. (b) Principles-based: the regulator articulates broad principles and allows organizations to determine how to implement them. The UK's data protection approach under the ICO has principles-based elements. (c) Co-regulation: the regulator sets the framework and industry develops implementation standards within it. Australia's Privacy Act, where industry codes are developed by sectors and approved by the regulator, is an example. (d) Self-regulation: industry sets and enforces its own standards without direct government oversight. The US advertising industry's self-regulatory approach through the Digital Advertising Alliance exemplifies this model.

A.7. An omnibus regulatory model applies a single, comprehensive data protection law to all sectors. The EU's GDPR is the paradigmatic example: one regulation governs healthcare, finance, retail, government, and every other sector. A sectoral model addresses data protection through separate laws for different industries or data types. The US exemplifies this: HIPAA covers health data, FERPA covers educational records, COPPA covers children's data, GLBA covers financial data, and the FTC Act provides a residual consumer protection authority. The US model is called "sectoral" because there is no single federal law that covers all personal data -- each sector has its own rules (or, in some sectors, no specific rules at all), creating a patchwork with gaps between sectors.


Chapter 21: The EU AI Act and Risk-Based Regulation

A.1. The EU AI Act classifies AI systems into four risk tiers: (1) Unacceptable risk -- prohibited outright (e.g., social scoring by governments, real-time remote biometric identification in public spaces with narrow exceptions). (2) High risk -- permitted but subject to stringent requirements (e.g., AI in critical infrastructure, education, employment, law enforcement, credit scoring). Requirements include risk management systems, data governance, transparency, human oversight, accuracy and robustness standards, and conformity assessments. (3) Limited risk -- subject to transparency obligations (e.g., chatbots must disclose they are AI; deepfakes must be labeled). (4) Minimal risk -- no specific requirements beyond existing law (e.g., AI-enabled video games, spam filters). The risk-based approach represents a departure from treating all AI systems the same, instead calibrating regulatory burden to the potential for harm.


Chapter 22: Data Governance Frameworks and Institutions

A.1. Data governance is the exercise of authority and control over data assets -- it sets policies, standards, and decision rights. Data management is the implementation of those policies through technical and operational activities. Example governance decision: "Patient data shall be classified as restricted, accessible only to authorized clinical personnel, and retained for seven years after the last clinical visit." Example management activity: configuring database access controls, implementing backup procedures, and writing the scripts that automatically archive records at the seven-year mark. Governance decides what should happen; management executes how it happens.

A.5. The six data quality dimensions: (1) Accuracy -- data correctly represents the real-world entity it describes (problem: a patient's allergy list is missing a known allergy). (2) Completeness -- all required data is present (problem: 15% of customer records lack email addresses). (3) Consistency -- data does not contradict itself across systems (problem: a customer's address differs between the billing and shipping databases). (4) Timeliness -- data is current enough for its intended use (problem: a risk model uses employment data that is two years old). (5) Validity -- data conforms to defined formats and business rules (problem: a date-of-birth field contains the value "13/32/1990"). (6) Uniqueness -- each entity is represented once (problem: duplicate patient records cause a hospital to maintain two separate medication lists for the same person, risking drug interactions).


Chapter 23: Cross-Border Data Flows and Digital Sovereignty

A.1. An adequacy decision is a determination by one jurisdiction (typically the EU) that another country's data protection laws provide an "adequate" level of protection comparable to its own. When a country receives an adequacy decision, personal data can flow freely to that country without additional safeguards. In the absence of adequacy, organizations must use alternative transfer mechanisms such as Standard Contractual Clauses (SCCs), Binding Corporate Rules (BCRs), or specific derogations. The significance is that adequacy decisions function as a form of regulatory extraterritoriality -- they give the EU leverage to influence data protection standards globally, because countries seeking trade access to the EU market have economic incentives to adopt compatible protections.


Chapter 24: Sector-Specific Governance

A.3. FERPA (Family Educational Rights and Privacy Act) governs student educational records in institutions that receive federal funding. Under FERPA, educational institutions must: (1) provide parents (or eligible students over 18) access to educational records, (2) provide the opportunity to request corrections, (3) obtain consent before disclosing personally identifiable information from education records, with specific exceptions (e.g., school officials with legitimate educational interest, transfer to other schools, health and safety emergencies). COPPA (Children's Online Privacy Protection Act) governs the collection of personal information from children under 13 by commercial websites and online services. COPPA requires verifiable parental consent before collection and gives parents the right to review and delete their children's data. The key distinction is that FERPA governs educational institutions, while COPPA governs commercial services -- and neither fully addresses the gap created by educational technology companies that operate in both spaces.


Chapter 25: Enforcement, Compliance, and the Limits of Law

A.1. Regulatory capture occurs when a regulatory agency, created to act in the public interest, becomes dominated by the interests of the industry it is supposed to regulate. In data governance, the chapter identifies three capture mechanisms: (1) revolving door -- regulators move to industry positions and vice versa, creating alignment of perspectives; (2) information asymmetry -- regulated companies have far more technical knowledge about their own systems than regulators, forcing regulators to rely on industry-provided information; (3) resource asymmetry -- technology companies' legal and lobbying budgets dwarf regulatory enforcement budgets. The governance implication is that legal frameworks, however well-designed, are only as effective as the institutions that enforce them.


Chapter 26: Building a Data Ethics Program

A.1. A data ethics program, as distinguished from a compliance program, goes beyond legal requirements to address the question: "Is this the right thing to do?" The chapter identifies four differences: (1) scope -- compliance asks "is this legal?" while ethics asks "is this right?" (2) orientation -- compliance is reactive (responding to existing rules) while ethics is proactive (anticipating emerging concerns); (3) culture -- compliance can be a checkbox exercise while ethics requires organizational culture change; (4) stakeholders -- compliance focuses on the organization's legal risk while ethics centers the interests of affected communities. The VitraMed example illustrates the distinction: VitraMed could be fully HIPAA-compliant while still deploying a patient risk model that systematically disadvantages minority patients -- legal compliance does not guarantee ethical conduct.


Chapter 27: Data Stewardship and the Chief Data Officer

A.3. Data owner: the individual or role accountable for a data asset's governance policies. In a hospital, the Chief Medical Officer might be the data owner for patient clinical records -- responsible for defining access policies, classification levels, and retention rules. Data steward: the individual responsible for day-to-day governance of a data domain, ensuring policies are followed and data quality is maintained. A clinical data steward might review data quality reports, resolve data disputes, and ensure that access requests follow established protocols. Data custodian: the technical role responsible for the infrastructure on which data is stored and processed. The database administrator who manages the servers, runs backups, and implements access controls is a data custodian. The key distinction: the owner sets policy, the steward ensures policy is followed, and the custodian provides the technical infrastructure.

C.1. [PYTHON] DataLineageTracker Usage:

from dataclasses import dataclass, field
from datetime import datetime, date
from typing import Optional

@dataclass
class TransformationRecord:
    operation: str
    performed_by: str
    timestamp: datetime
    description: str
    rows_before: int
    rows_after: int

@dataclass
class AccessRecord:
    user: str
    access_type: str  # read, write, export, delete
    timestamp: datetime
    approved: bool
    purpose: str

@dataclass
class DataLineageTracker:
    asset_name: str
    source: str
    classification: str
    storage_location: str
    retention_policy: str
    retention_expiry: Optional[date] = None
    transformations: list = field(default_factory=list)
    access_log: list = field(default_factory=list)

    def add_transformation(self, record: TransformationRecord):
        self.transformations.append(record)

    def log_access(self, record: AccessRecord):
        self.access_log.append(record)

    def generate_report(self) -> str:
        lines = [
            f"Data Lineage Report: {self.asset_name}",
            f"Source: {self.source}",
            f"Classification: {self.classification}",
            f"Storage: {self.storage_location}",
            f"Retention: {self.retention_policy}",
            f"Expiry: {self.retention_expiry}",
            f"\nTransformations ({len(self.transformations)}):"
        ]
        for t in self.transformations:
            lines.append(f"  - {t.operation} by {t.performed_by} "
                        f"({t.rows_before} -> {t.rows_after} rows)")
        lines.append(f"\nAccess Log ({len(self.access_log)} entries):")
        for a in self.access_log:
            status = "APPROVED" if a.approved else "UNAPPROVED"
            lines.append(f"  - {a.access_type} by {a.user} [{status}]")
        return "\n".join(lines)

# Create tracker for VitraMed patient demographics
tracker = DataLineageTracker(
    asset_name="Patient Demographics",
    source="Clinic intake forms",
    classification="Restricted",
    storage_location="vitramed-prod-db-east",
    retention_policy="7 years from last clinical visit",
    retention_expiry=date(2032, 12, 31)
)

tracker.add_transformation(TransformationRecord(
    operation="De-identification",
    performed_by="Dr. Khoury",
    timestamp=datetime(2025, 6, 15, 10, 30),
    description="Removed direct identifiers for research dataset",
    rows_before=12500, rows_after=12500
))

tracker.add_transformation(TransformationRecord(
    operation="Aggregation by clinic",
    performed_by="Mira Chakravarti",
    timestamp=datetime(2025, 7, 1, 14, 0),
    description="Aggregated to clinic-level summary statistics",
    rows_before=12500, rows_after=47
))

tracker.log_access(AccessRecord(
    user="Dr. Adeyemi", access_type="read",
    timestamp=datetime(2025, 7, 5, 9, 0),
    approved=True, purpose="Research review"
))

print(tracker.generate_report())

Chapter 28: Privacy Impact Assessments and Ethical Reviews

A.1. A Privacy Impact Assessment (PIA) is a systematic process for evaluating how a proposed project, system, or initiative collects, uses, and manages personal information. The chapter identifies six stages: (1) project description -- what the system does and what data it processes; (2) data flow mapping -- where data comes from, where it goes, and who has access; (3) privacy risk identification -- what could go wrong, including unauthorized access, function creep, re-identification, and disproportionate impact; (4) risk mitigation -- controls and safeguards to reduce identified risks; (5) stakeholder consultation -- input from affected parties; (6) documentation and review -- a written record subject to periodic reassessment. A PIA differs from a security assessment in that security focuses on preventing unauthorized access, while a PIA asks whether the authorized uses of data are themselves appropriate.


Chapter 29: Responsible AI Development

A.2. A model card (Mitchell et al., 2019) is a standardized documentation artifact that accompanies a machine learning model. Its intended audiences include: (1) developers who need to understand the model's capabilities and limitations for integration; (2) policymakers and regulators who need to assess compliance and risk; (3) affected communities who need to understand how the model works and whether it is fair; and (4) procurement officers who evaluate models for organizational adoption. Each audience needs different information: developers need technical specifications, regulators need performance disaggregated by protected groups, affected communities need plain-language explanations of impact, and procurement officers need information about intended and out-of-scope uses.

C.1. [PYTHON] ModelCard for Student Dropout Prediction:

from dataclasses import dataclass, field

@dataclass
class ModelCard:
    model_name: str
    version: str
    description: str
    intended_use: str
    out_of_scope_uses: list
    training_data_summary: str
    evaluation_metrics: dict
    disaggregated_metrics: dict
    ethical_considerations: list
    limitations: list
    last_updated: str

    def generate_report(self) -> str:
        lines = [
            f"MODEL CARD: {self.model_name} (v{self.version})",
            f"Last updated: {self.last_updated}",
            f"\nDescription: {self.description}",
            f"\nIntended Use: {self.intended_use}",
            "\nOut-of-Scope Uses:"
        ]
        for u in self.out_of_scope_uses:
            lines.append(f"  - {u}")
        lines.append(f"\nTraining Data: {self.training_data_summary}")
        lines.append("\nOverall Metrics:")
        for k, v in self.evaluation_metrics.items():
            lines.append(f"  {k}: {v}")
        lines.append("\nDisaggregated Metrics:")
        for group, metrics in self.disaggregated_metrics.items():
            lines.append(f"  {group}:")
            for k, v in metrics.items():
                lines.append(f"    {k}: {v}")
        lines.append("\nEthical Considerations:")
        for c in self.ethical_considerations:
            lines.append(f"  - {c}")
        lines.append("\nLimitations:")
        for l in self.limitations:
            lines.append(f"  - {l}")
        return "\n".join(lines)

card = ModelCard(
    model_name="Student Retention Risk Predictor",
    version="1.2",
    description="Logistic regression model predicting first-year "
                "student non-return for second year.",
    intended_use="Flag at-risk students for proactive advising.",
    out_of_scope_uses=[
        "Admissions decisions",
        "Financial aid allocation",
        "Academic probation determinations"
    ],
    training_data_summary="5 cohorts (2019-2023), n=12,450, "
                          "from a large public university.",
    evaluation_metrics={"AUC": 0.78, "Accuracy": 0.73, "F1": 0.65},
    disaggregated_metrics={
        "White students": {"AUC": 0.80, "FPR": 0.18},
        "Black students": {"AUC": 0.72, "FPR": 0.28},
        "Hispanic students": {"AUC": 0.74, "FPR": 0.25},
    },
    ethical_considerations=[
        "Higher FPR for Black students may lead to over-referral.",
        "Financial aid status is a proxy for socioeconomic status.",
        "LMS engagement may reflect digital access, not motivation."
    ],
    limitations=[
        "Trained on one institution; may not generalize.",
        "Does not account for transfer students.",
        "Performance degrades for students over age 25."
    ],
    last_updated="2025-09-01"
)
print(card.generate_report())

Chapter 30: When Things Go Wrong

A.1. The chapter distinguishes between a data breach (unauthorized access to data) and a data incident (a broader category including unauthorized access, accidental disclosure, data loss, and system failures that compromise data integrity). Not every incident is a breach, but every breach is an incident. The governance distinction matters because breach notification requirements (under GDPR, CCPA, and other laws) are triggered by specific conditions -- typically involving unauthorized access to personal data that poses a risk to individuals' rights and freedoms. An incident that does not involve personal data, or that is contained before any data is accessed, may not trigger notification requirements but may still require internal response and remediation.


Chapter 31: Misinformation, Disinformation, and Platform Governance

A.1. (a) Malinformation -- the email is authentic but is shared with the intent to harm and is decontextualized. (b) Misinformation -- the content is false but shared without intent to deceive. (c) Disinformation -- fabricated content created and disseminated with deliberate intent to deceive. (d) Misinformation -- the content was believed to be true when published; the journalist did not intend to deceive. (e) Malinformation -- the statistics are real but are selectively presented to create a misleading impression.


Chapter 32: Digital Divide, Data Justice, and Equity

A.1. The digital divide has three levels as described in the chapter: (1) the access divide -- differential access to digital infrastructure (internet connectivity, devices); (2) the skills divide -- differential ability to use digital tools effectively; (3) the outcomes divide -- differential ability to benefit from digital participation. The chapter argues that the third level is most significant for data governance because even when access and skills gaps are closed, structural inequalities in how data systems treat different populations can reproduce disadvantage. A community might have internet access and digital literacy but still be disadvantaged by algorithms that use their zip code as a proxy for risk.


Chapter 33: Labor, Automation, and the Gig Economy

A.1. Algorithmic management refers to the use of automated systems to direct, monitor, evaluate, and discipline workers. The chapter identifies five components: (1) task allocation -- algorithms assign work without human managers; (2) performance evaluation -- algorithms rate workers based on quantified metrics; (3) dynamic pricing -- algorithms set pay rates in real time; (4) automated discipline -- algorithms deactivate or penalize workers without human review; (5) information asymmetry -- the algorithm's criteria are opaque to workers. The governance concern is that algorithmic management combines the scale and speed of automation with the opacity of proprietary systems, creating a workplace where workers are managed by a system they cannot see, understand, or appeal.


Chapter 34: Environmental Data Ethics and Climate

A.2. PUE = Total facility energy / IT equipment energy. A PUE of 1.58 means the data center uses 58% more energy than its IT equipment alone consumes -- the excess goes to cooling, lighting, and other overhead. A PUE of 1.10 means only 10% overhead. For a workload requiring 100 kWh of compute energy: at PUE 1.58, total energy = 158 kWh; at PUE 1.10, total energy = 110 kWh. The PUE 1.58 facility uses 158/110 = 1.44x, or 44% more total energy for the same computation. Across millions of GPU-hours, this difference translates to thousands of tonnes of additional carbon emissions and millions of liters of additional water consumption.

C.1. [PYTHON] CarbonEstimator Basic Usage:

from dataclasses import dataclass

@dataclass
class CarbonEstimator:
    """Estimate carbon emissions from AI model training."""
    gpu_type: str
    num_gpus: int
    training_hours: float
    cloud_region: str
    pue: float = 1.1

    GPU_POWER = {  # watts per GPU
        'A100': 250, 'H100': 350, 'V100': 300, 'T4': 70
    }
    CARBON_INTENSITY = {  # gCO2 per kWh
        'us-east': 380, 'us-west': 210, 'canada-central': 30,
        'eu-west': 270, 'eu-north': 50, 'asia-southeast': 490,
        'india-central': 700
    }
    FLIGHT_KG = 900  # kg CO2 per transatlantic flight

    def total_energy_kwh(self) -> float:
        power_w = self.GPU_POWER[self.gpu_type]
        return (power_w * self.num_gpus * self.training_hours
                * self.pue / 1000)

    def total_carbon_kg(self) -> float:
        intensity = self.CARBON_INTENSITY[self.cloud_region]
        return self.total_energy_kwh() * intensity / 1000

    def flights_equivalent(self) -> float:
        return self.total_carbon_kg() / self.FLIGHT_KG

# (a) 4 A100s, 48h, us-east
est_a = CarbonEstimator('A100', 4, 48, 'us-east')
print(f"(a) Energy: {est_a.total_energy_kwh():.1f} kWh, "
      f"Carbon: {est_a.total_carbon_kg():.1f} kg, "
      f"Flights: {est_a.flights_equivalent():.2f}")

# (b) 4 A100s, 48h, canada-central
est_b = CarbonEstimator('A100', 4, 48, 'canada-central')
print(f"(b) Energy: {est_b.total_energy_kwh():.1f} kWh, "
      f"Carbon: {est_b.total_carbon_kg():.1f} kg, "
      f"Flights: {est_b.flights_equivalent():.2f}")

# (c) 4 H100s, 24h, us-east
est_c = CarbonEstimator('H100', 4, 24, 'us-east')
print(f"(c) Energy: {est_c.total_energy_kwh():.1f} kWh, "
      f"Carbon: {est_c.total_carbon_kg():.1f} kg, "
      f"Flights: {est_c.flights_equivalent():.2f}")

Region choice dramatically affects carbon emissions: the same training run in Canada (hydroelectric grid, ~30 gCO2/kWh) produces roughly 1/13th the carbon of training in the US East (fossil-heavy grid, ~380 gCO2/kWh). This makes data center location a governance decision with direct environmental consequences.


Chapter 35: Children, Teens, and Digital Vulnerability

A.1. COPPA applies to commercial websites and online services directed at children under 13, or that knowingly collect personal information from children under 13. The Act requires operators to: post a clear privacy policy, provide direct notice to parents, obtain verifiable parental consent before collecting data, give parents the right to review and delete their child's data, and refrain from conditioning participation on unnecessary data collection. The chapter identifies two significant limitations: (1) COPPA's age threshold of 13 does not protect teenagers, who face many of the same vulnerabilities; (2) COPPA's "directed at children" standard allows platforms to disclaim applicability by technically targeting a general audience, even when they know children use the service extensively.


Chapter 36: National Security, Intelligence, and Democratic Oversight

A.3. The Five Eyes alliance (US, UK, Canada, Australia, New Zealand) is a multilateral intelligence-sharing arrangement. Its governance significance for data protection is that it enables member states to circumvent domestic surveillance restrictions by having partner states collect data on their citizens and share it. For example, if Country A's law prohibits it from surveilling its own citizens without a warrant, Country B can collect that data and share it under the alliance agreement, potentially bypassing Country A's domestic legal protections. This creates what the chapter calls a "sovereignty loophole" in data protection law.


Chapter 37: Global South Perspectives on Data Governance

A.1. Data colonialism, as described in the chapter, refers to the extraction of data from Global South populations by Global North corporations and institutions in ways that replicate colonial power dynamics. The chapter identifies three structural parallels: (1) resource extraction -- data is collected from communities in the Global South and processed in the Global North, with value captured by Northern companies; (2) infrastructure dependency -- Global South countries depend on Northern-owned digital infrastructure (cloud services, platforms, undersea cables); (3) epistemic dominance -- governance frameworks developed in the Global North are imposed on Global South contexts without adaptation to local values, needs, or institutional capacities.


Chapter 38: Emerging Technologies and Anticipatory Governance

A.1. Anticipatory governance is a governance approach that seeks to identify and address the social, ethical, and political implications of emerging technologies before they are widely deployed, rather than waiting for harms to materialize. The chapter distinguishes it from reactive governance (responding to demonstrated harms) and precautionary governance (restricting technologies until safety is proven). Anticipatory governance attempts a middle path: it does not ban technologies but invests in foresight, scenario planning, and adaptive regulation so that governance structures are prepared to respond quickly as technologies develop. The chapter identifies three key tools: horizon scanning, scenario planning, and regulatory sandboxes.


Chapter 39: Designing Data Futures

A.2. Data cooperatives: organizations owned and governed by their members, who collectively control how their data is collected, used, and shared. Governance authority rests with the membership through democratic processes (one member, one vote). Data trusts: legal structures where a trustee holds and manages data on behalf of beneficiaries, with a fiduciary obligation to act in beneficiaries' interests. Governance authority rests with the trustee, constrained by fiduciary duty. Data commons: shared data resources governed by community rules rather than individual ownership or market mechanisms. Governance authority is distributed among community members through shared norms and protocols. The key structural difference: cooperatives vest authority in democratic membership, trusts vest it in a fiduciary, and commons vest it in community norms. Cooperatives work best when members share clear interests; trusts work best when beneficiaries need expert management; commons work best for public goods where broad access serves collective benefit.

C.1. [PYTHON] GovernanceSimulator Analysis:

(a) Typically, the corporate centralized model produces the highest total benefit (because it has no coordination costs), while the cooperative democratic model produces the lowest Gini coefficient (most equal distribution). (b) Under corporate centralized, the stakeholder with the highest political influence benefits most, and the stakeholder with the lowest political influence benefits least, because the model distributes benefits in proportion to influence. (c) The cooperative model typically results in the fewest stakeholders below threshold, because its more equal distribution ensures that low-influence stakeholders receive a larger share. (d) The cooperative model provides the most equal privacy protection, while the regulatory model may provide the highest average privacy protection because it enforces minimum standards.


Chapter 40: Your Responsibility -- From Knowledge to Action

A.1. The five archetypes: (1) Naive Technicist (early Mira) -- believes technology is neutral and data-driven decisions are inherently objective. Strength: technical competence. Limitation: blind to power dynamics and structural bias. (2) Righteous Critic (early Eli) -- focuses on systemic injustice and corporate malfeasance. Strength: identifies harm. Limitation: can become paralyzed by anger or dismiss all engagement as complicity. (3) Pragmatic Insider (Ray Zhao) -- works within institutions to make incremental improvements. Strength: understands organizational constraints. Limitation: may normalize harmful practices by treating them as inevitable. (4) Strategic Advocate (Sofia Reyes, and Eli by Chapter 40) -- combines critical analysis with strategic engagement in policy and institutional reform. Strength: translates critique into action. Limitation: may overestimate the power of advocacy against entrenched interests. (5) Principled Practitioner (Mira by Chapter 40) -- integrates technical skill, ethical reasoning, and institutional awareness. Strength: can design systems that embody values. Limitation: individual principled practice cannot, alone, transform unjust structures.

A.4. The most important provision is arguably "I will consider the interests of those who cannot advocate for themselves" -- because the most serious data harms fall on populations with the least power to resist or seek remedy, and principled practice must center their interests. The most difficult provision to practice consistently is likely "I will speak up when I see data practices that cause harm, even when silence would be easier" -- because organizational pressures, career consequences, and the difficulty of whistleblowing create powerful incentives for silence, and the courage required to speak up is not a one-time act but a repeated, draining commitment.