Appendix F: Frequently Asked Questions

This appendix addresses 40 questions that students commonly ask during and after a data ethics course. Questions are organized by topic, and each answer includes a reference to the chapter(s) where the topic is discussed in depth.

Data Basics

Q1. What is the difference between data, information, and knowledge?

A: Data consists of raw, unprocessed facts -- numbers, text, images -- that have no meaning on their own. Information is data that has been organized, structured, or contextualized to convey meaning (e.g., "the average temperature was 72 degrees"). Knowledge is information that has been interpreted, analyzed, and integrated with experience to support understanding and decision-making (e.g., "because the average temperature is rising, crop yields will decline"). Data becomes information through processing; information becomes knowledge through interpretation. The governance implications differ at each level. (Chapter 1, Section 1.1.2)

Q2. Why does metadata matter? Is it really as sensitive as content?

A: Metadata -- data about data, such as who called whom, when, for how long, from where -- can be extraordinarily revealing. The MetaPhone study demonstrated that phone call metadata alone could accurately infer medical conditions, religious affiliations, and political activities. An entire year of email metadata (sender, recipient, time, subject line) can map your professional relationships, personal life, sleep schedule, and health concerns without anyone reading a single message. In many cases, metadata is more consequential than content because it is collected more extensively, retained longer, and regulated less stringently. (Chapter 1, Section 1.1.3)

Q3. What is the difference between anonymization and pseudonymization?

A: Anonymization is the irreversible removal of all information that could identify an individual, either directly or through combination with other data. Once truly anonymized, data is no longer personal data and falls outside data protection regulation. Pseudonymization replaces direct identifiers (like names) with artificial identifiers (pseudonyms) while retaining the ability to re-link the data to individuals using a separately stored key. Pseudonymized data remains personal data under the GDPR because re-identification is possible. The critical insight from the research literature (Sweeney, Narayanan, de Montjoye) is that true anonymization is far harder to achieve than most organizations realize -- quasi-identifiers and behavioral patterns often enable re-identification. (Chapter 10, Sections 10.3-10.4)

Q4. What is a quasi-identifier, and why does it matter?

A: A quasi-identifier is an attribute that is not directly identifying on its own but can be combined with other quasi-identifiers to uniquely identify an individual. Common quasi-identifiers include age, zip code, gender, occupation, and date of birth. Sweeney's research showed that 87% of the US population can be uniquely identified from just three quasi-identifiers: zip code, date of birth, and gender. Quasi-identifiers matter because organizations that remove names and Social Security numbers often leave quasi-identifiers intact, creating a false sense of anonymity. (Chapter 10, Section 10.4)

Privacy

Q5. What is contextual integrity, and how does it differ from the consent model?

A: Contextual integrity, developed by Helen Nissenbaum, holds that privacy is violated when information flows in ways that breach the norms appropriate to a given social context. Each social context (healthcare, education, friendship, commerce) has its own norms governing what information is appropriate to share, with whom, and under what conditions. Unlike the consent model, which treats privacy as a binary (you consented or you did not), contextual integrity evaluates whether a specific data practice respects the norms governing the relevant context. A patient's medical information shared with their doctor respects healthcare norms; the same information shared with an advertiser violates them, regardless of whether the patient clicked "I agree" on a terms-of-service page. (Chapter 7, Section 7.3)

Q6. Is the "nothing to hide" argument valid?

A: The "nothing to hide" argument -- "I don't mind surveillance because I have nothing to hide" -- is addressed at length in Chapter 7 (Section 7.4). The argument fails for several reasons: (1) Everyone has something to hide -- not because they are doing something wrong, but because privacy is necessary for autonomy, dignity, and the freedom to experiment, fail, and change. (2) The argument conflates privacy with secrecy; privacy is about controlling the context of disclosure, not concealing wrongdoing. (3) It assumes current power holders will remain benign -- but information collected today can be used by future governments or institutions with different values. (4) It ignores collective harms: even if you personally are unharmed, mass surveillance chills dissent, protest, and democratic participation for everyone. (5) It places the burden on the wrong party: in a just society, those who seek to surveil must justify their intrusion, not those who seek to be left alone.

Q7. What is the privacy paradox?

A: The privacy paradox describes the gap between people's stated privacy preferences (surveys show high concern about data privacy) and their actual behavior (people routinely disclose personal information for small benefits and rarely exercise their privacy rights). Research suggests this gap is explained not by hypocrisy but by: bounded rationality (privacy trade-offs are too complex to evaluate), temporal discounting (privacy costs are future and uncertain; service benefits are immediate), structural constraints (opting out of data collection often means opting out of essential services), and design manipulation (dark patterns make privacy-protective choices harder). The paradox is less a puzzle about individual behavior and more an indictment of a governance model that relies on individuals to protect themselves against well-resourced organizations. (Chapter 11, Section 11.3)

Q8. Does a VPN make me anonymous?

A: No. A VPN encrypts your internet traffic between your device and the VPN server and hides your IP address from the websites you visit. This prevents your internet service provider from seeing your browsing activity and prevents websites from seeing your real IP address. However, a VPN does not protect against: browser fingerprinting, cookies and tracking scripts, account-based tracking (if you log into Google or Facebook, the VPN is irrelevant), DNS leaks, or surveillance at the VPN provider itself. A VPN shifts trust from your ISP to the VPN provider -- if the provider logs your activity, you have gained nothing. A VPN is one layer of protection, not a privacy solution. (Appendix E, Section E.1.4)

Algorithms and AI

Q9. What does it mean to say an algorithm is "biased"?

A: Algorithmic bias, as defined in Chapter 14, is a systematic and unjustified pattern in an algorithm's outputs that disadvantages certain groups. "Systematic" means the pattern is not random but consistent. "Unjustified" distinguishes bias from legitimate differences -- an insurance algorithm that charges higher premiums to riskier drivers is discriminating based on relevant risk factors, not exhibiting bias in the technical sense. The chapter identifies six sources of bias: historical (training data reflects past discrimination), representation (certain groups are underrepresented in training data), measurement (the variables used are poor proxies for what they claim to measure), aggregation (a single model applied to diverse populations), evaluation (benchmarks do not represent all affected groups), and deployment (a model is used in a context different from the one it was designed for). (Chapter 14, Section 14.2)

Q10. Can we make algorithms fair by removing race and gender from the data?

A: No. Removing protected attributes like race or gender from a model's features does not eliminate bias because of redundant encoding -- other features in the data carry the same information. Zip code correlates with race due to residential segregation. Name patterns, language use, educational institution, and many other features serve as proxies for protected characteristics. The Amazon hiring algorithm case illustrates this: Amazon removed gender as a feature, but the algorithm had learned to penalize resumes that included the word "women's" (as in "women's chess club") and graduates of all-women's colleges. Removing the gender variable did not remove gender from the data. Effective bias mitigation requires structural interventions -- changing training data, modifying optimization targets, applying fairness constraints -- not just feature removal. (Chapter 14, Section 14.3)

Q11. What is the impossibility theorem in algorithmic fairness?

A: The impossibility theorem, proved independently by Chouldechova (2017) and Kleinberg, Mullainathan, and Raghavan (2016), demonstrates that when two groups have different base rates (different proportions of positive outcomes), a prediction system cannot simultaneously satisfy calibration (the same score means the same probability for both groups), equal false positive rates, and equal false negative rates. This is a mathematical fact, not a limitation of any specific algorithm. The practical consequence is that there is no single, value-neutral definition of "fairness" -- choosing which fairness criterion to prioritize is an ethical and political decision, not a technical one. (Chapter 15, Section 15.6)

Q12. What are model cards and datasheets?

A: Model cards (Mitchell et al., 2019) are standardized documentation for machine learning models that describe the model's purpose, architecture, training data, performance metrics (including disaggregated performance across demographic groups), intended uses, out-of-scope uses, ethical considerations, and limitations. Datasheets (Gebru et al., 2021) are standardized documentation for datasets that describe the dataset's motivation, composition, collection process, preprocessing, recommended uses, and known limitations. Together, they form a documentation pipeline for responsible AI: datasheets document the inputs, and model cards document the system built from those inputs. (Chapter 29, Sections 29.2-29.3)

Q13. What is explainable AI (XAI), and why does it matter?

A: Explainable AI refers to methods for making algorithmic decisions understandable to humans. Common XAI techniques include LIME (Local Interpretable Model-Agnostic Explanations), which approximates a complex model's behavior locally with a simpler model, and SHAP (SHapley Additive exPlanations), which attributes a prediction to the contribution of individual features. XAI matters for governance because: (1) affected individuals have a right to understand decisions about them (the GDPR's Article 22 protects against purely automated decision-making); (2) auditors need to understand how a system works to identify bias; (3) developers need to diagnose and fix errors. However, the chapter warns against "transparency theater" -- explanations that appear informative but do not actually enable understanding or accountability. (Chapter 16)

Q14. What is a deepfake, and what are the governance implications?

A: A deepfake is synthetic media -- typically video, audio, or images -- generated by AI to convincingly depict a person saying or doing something they never actually said or did. Governance implications include: non-consensual intimate imagery, political manipulation, fraud (voice cloning for scam calls), and erosion of trust in authentic media (the "liar's dividend" -- even real evidence can be dismissed as a deepfake). Governance responses include watermarking and provenance standards (C2PA), platform content policies, and legislation criminalizing specific uses. (Chapter 18, Section 18.4)

Regulation and Governance

Q15. What is the GDPR, and does it apply to companies outside Europe?

A: The General Data Protection Regulation is the European Union's comprehensive data protection law, in effect since May 25, 2018. It applies to any organization, anywhere in the world, that processes the personal data of individuals in the EU -- if the organization offers goods or services to EU residents or monitors their behavior. This extraterritorial scope means that a US-based company with EU customers must comply with the GDPR even if it has no offices in Europe. Key requirements include: lawful basis for processing, data minimization, purpose limitation, data subject rights (access, rectification, erasure, portability), data protection by design and default, mandatory breach notification within 72 hours, and accountability (demonstrating compliance). Maximum penalties are 4% of global annual turnover or 20 million euros, whichever is higher. (Chapter 20, Section 20.4)

Q16. Why doesn't the US have a comprehensive federal privacy law?

A: The US uses a sectoral approach to data protection: HIPAA for health data, FERPA for educational records, COPPA for children's data, GLBA for financial data, and the FTC Act as a residual consumer protection authority. There is no single federal law equivalent to the GDPR. This is partly historical (US privacy law developed in response to specific scandals rather than as a comprehensive framework), partly ideological (the US tradition emphasizes market self-regulation and is skeptical of comprehensive regulation), and partly political (industry lobbying and disagreements between state and federal authority have blocked comprehensive legislation). The result is a patchwork with significant gaps -- particularly for data collected by entities that do not fit neatly into any regulated sector. State laws like the CCPA/CPRA are partially filling these gaps. (Chapter 20, Section 20.3)

Q17. What is the EU AI Act?

A: The EU AI Act (2024) is the world's first comprehensive regulation of artificial intelligence. It classifies AI systems into four risk categories: prohibited (social scoring, subliminal manipulation, real-time public biometric identification with narrow exceptions); high-risk (AI in critical infrastructure, education, employment, law enforcement) subject to extensive requirements; limited-risk (chatbots, deepfakes) subject to transparency obligations; and minimal-risk with no specific requirements. It also regulates general-purpose AI models, including large language models. (Chapter 21)

Q18. What is data localization, and why is it controversial?

A: Data localization refers to laws or policies requiring that data about a country's residents be stored within that country's borders. Proponents argue it protects citizens' data from foreign government access, strengthens national sovereignty, and supports domestic industry. Critics argue it fragments the global internet, increases costs for businesses, and can be used by authoritarian governments to control information. Examples include Russia's data localization law, China's PIPL requirements for critical infrastructure, and India's conditional localization provisions. The tension between free data flows and data sovereignty is one of the defining governance challenges of the field. (Chapter 23, Section 23.3)

Q19. What is the difference between compliance and ethics?

A: Compliance is the practice of conforming to laws, regulations, and standards. Ethics is the practice of doing what is right, which may go beyond or even conflict with legal requirements. A system can be legally compliant and ethically problematic (e.g., a HIPAA-compliant health algorithm that systematically disadvantages Black patients). A practitioner can act ethically and violate regulations (e.g., a whistleblower who discloses corporate wrongdoing in violation of a non-disclosure agreement). Data ethics programs go beyond compliance by asking not just "Is this legal?" but "Is this right? Who could be harmed? What would we think of this practice if it were public?" (Chapter 26, Section 26.1)

Q20. What is a Privacy Impact Assessment (PIA)?

A: A PIA is a systematic process for evaluating how a proposed project or system collects, uses, and manages personal information. It involves: describing the data flows, identifying privacy risks, evaluating the severity and likelihood of each risk, proposing mitigation measures, consulting affected stakeholders, and documenting the assessment for review. The GDPR requires Data Protection Impact Assessments (DPIAs) for processing that is "likely to result in a high risk" to individuals' rights (Article 35). PIAs should be conducted before a system is built, not after deployment. (Chapter 28)

Specific Topics

Q21. How does algorithmic bias affect healthcare?

A: The Obermeyer et al. (2019) study demonstrated that a widely used healthcare algorithm systematically underestimated the health needs of Black patients because it used healthcare spending as a proxy for health need. Because Black patients have historically had less access to healthcare, they spent less, and the algorithm interpreted lower spending as lower need -- when in fact they were sicker at every spending level. This finding applies beyond one algorithm: any healthcare system that uses historical utilization data as a proxy for need risks embedding the consequences of past discrimination. VitraMed's patient risk model faces this exact challenge. (Chapter 14, Section 14.5)

Q22. What is the environmental cost of AI?

A: AI systems consume significant energy for training and inference. Strubell et al. (2019) estimated that training a large NLP model can produce as much CO2 as five cars over their lifetimes. Data centers consumed approximately 460 terawatt-hours of electricity globally in 2024, comparable to the total electricity consumption of a mid-size country. Environmental costs include: energy consumption (electricity for computing and cooling), water consumption (for data center cooling), e-waste (from rapid GPU obsolescence), and mineral extraction (for hardware manufacturing). These costs are unevenly distributed, with data center environmental impacts disproportionately affecting specific communities. (Chapter 34)

Q23. What are data cooperatives?

A: Data cooperatives are organizations owned and democratically governed by their members, who collectively control how their data is collected, used, and shared. Unlike the current model where platforms own user data, a data cooperative would allow members to pool their data, negotiate collectively with companies seeking access, and ensure that the value generated from their data benefits the membership. Examples include MIDATA (Switzerland), Driver's Seat Cooperative (US), and Salus Coop (Barcelona). Data cooperatives are one of the participatory governance models discussed in Chapter 39.

Q24. What is the "Brussels Effect"?

A: The Brussels Effect refers to the EU's ability to shape global regulatory standards through the sheer size of its market. When the EU adopts a regulation like the GDPR or the AI Act, companies that want access to the EU market must comply -- and often find it simpler to adopt the EU standard globally rather than maintaining different practices for different jurisdictions. This means EU regulations effectively set the floor for global standards, even in countries that have not adopted similar laws. The term was coined by Anu Bradford. (Chapter 20, Section 20.4)

Q25. What is differential privacy?

A: Differential privacy is a mathematical framework for releasing statistical information about a dataset without revealing information about any individual in the dataset. It works by adding carefully calibrated random noise to query results. The key parameter, epsilon, controls the privacy-accuracy trade-off: smaller epsilon means stronger privacy but noisier results; larger epsilon means more accurate results but weaker privacy. The mathematical guarantee is that the output of any query changes by at most a small amount whether or not any single individual's data is included -- making it impossible to determine whether a specific person is in the dataset. The US Census Bureau used differential privacy for the 2020 Census. (Chapter 10, Section 10.5)

Ethics and Values

Q26. What are the main ethical frameworks used in data ethics?

A: Chapter 6 introduces five frameworks: (1) Utilitarianism -- maximize overall well-being; evaluate data practices by their consequences for all affected parties. (2) Deontology (Kant) -- respect persons as ends in themselves, never merely as means; uphold duties and rights regardless of consequences. (3) Virtue ethics (Aristotle) -- cultivate character traits (honesty, justice, courage) that enable good practice. (4) Care ethics -- prioritize relationships, vulnerability, and the responsibilities that flow from them. (5) Justice theory (Rawls) -- evaluate data systems from behind a "veil of ignorance," prioritizing the interests of the least advantaged. The chapter argues for moral pluralism: no single framework captures all ethical considerations, and practitioners should be able to apply multiple frameworks and navigate their disagreements.

Q27. What is Rawls's "veil of ignorance," and how does it apply to data governance?

A: The veil of ignorance is a thought experiment: imagine you must design the rules governing a data system, but you do not know which position in the system you will occupy -- you might be the data collector, the data subject, a vulnerable community member, or a regulator. From behind this veil, rational people would design rules that protect the most vulnerable, because they might turn out to be that person. Applied to data governance, the veil suggests that we should design data systems as if we might be the person most harmed by them. This provides a powerful test for evaluating surveillance systems, algorithmic decision-making, and data collection practices. (Chapter 6, Section 6.6)

Q28. How should I respond when my manager asks me to do something I believe is unethical?

A: The chapter on personal responsibility (Chapter 40) provides a framework: (1) Verify your ethical concern -- are you confident the practice is harmful, or might you be missing context? (2) Raise the concern constructively -- frame it as a question rather than an accusation, and propose alternatives. (3) Document everything. (4) Know your organization's reporting channels -- does it have an ethics committee, ombudsperson, or whistleblower policy? (5) Assess the severity -- is the harm imminent and irreversible, or is there time for internal resolution? (6) Know your legal protections -- many jurisdictions have whistleblower protection laws. (7) Have a support network -- mentors, professional organizations, and personal contacts who can provide advice. The Practitioner's Oath (Section 40.8) includes the provision "I will speak up when I see data practices that cause harm, even when silence would be easier." Living up to this provision requires preparation, not just aspiration.

Career and Professional Development

Q29. What careers are available in data ethics and governance?

A: The field offers diverse career pathways including: Data Protection Officer, Privacy Engineer, AI Ethics Researcher, Chief Data Officer, Privacy Analyst/Consultant, Policy Analyst (data/AI), Algorithmic Auditor, Content Policy Specialist, and Digital Rights Advocate. The most in-demand roles combine technical skill with governance knowledge -- organizations increasingly need people who can both build systems and evaluate their social impacts. See Appendix E, Section E.3 for detailed career guidance.

Q30. Do I need a law degree to work in data governance?

A: No. Data governance is an interdisciplinary field that draws on law, technology, social science, philosophy, and management. While a law degree is valuable (and sometimes required) for Data Protection Officer roles and regulatory positions, many governance roles require technical skills, organizational design experience, or policy analysis capabilities that are better developed through other educational pathways. Certifications like CIPP, CIPM, and CIPT demonstrate competence without requiring a law degree.

Q31. What programming skills are useful for data ethics work?

A: Python is the most useful single language, providing access to data analysis (pandas), statistical modeling (scikit-learn), visualization (matplotlib), and the machine learning ecosystem. SQL is valuable for querying databases and understanding data infrastructure. R is useful for statistical analysis and is common in academic research. Beyond specific languages, the most important skills are: understanding data structures, writing code to audit algorithmic systems, and the ability to translate between technical and non-technical audiences. The Python chapters in this textbook (10, 14, 15, 22, 27, 29, 34) and Appendix G provide a foundation.

Contemporary Issues

Q32. How does generative AI change data governance?

A: Generative AI introduces at least five new governance challenges: (1) Training data rights -- models are trained on vast quantities of data, often without the explicit consent of creators, raising copyright and consent questions. (2) Output ownership -- who owns AI-generated content? (3) Deepfakes and synthetic media -- convincing fake content at scale undermines trust. (4) Hallucination -- models produce plausible but false outputs. (5) Data provenance -- it becomes difficult to distinguish AI-generated content from human-created content. Existing governance frameworks were designed for a world where data was collected from people and used to make decisions about them; generative AI adds a world where data is used to create new content that can affect people in entirely different ways. (Chapter 18)

Q33. What is the role of AI in misinformation?

A: AI contributes to misinformation through both production and distribution. On the production side, generative AI makes it easier and cheaper to produce convincing false content -- text, images, audio, and video. On the distribution side, recommendation algorithms amplify content that generates engagement, and research shows that false, novel, and emotionally charged content generates more engagement than accurate content (Vosoughi et al., 2018). The combination creates a structural advantage for misinformation. Governance responses include platform transparency requirements, content provenance standards, media literacy education, and regulatory frameworks like the EU's Digital Services Act. (Chapter 31)

Q34. What is "surveillance capitalism"?

A: A term coined by Shoshana Zuboff to describe an economic system in which human experience is claimed as free raw material for the production of prediction products that are sold to business customers. The key concept is "behavioral surplus" -- data collected beyond what is needed to improve a service, extracted and used to predict and influence behavior. Zuboff argues that surveillance capitalism represents a new form of power that operates by knowing and shaping human behavior without the knowledge or consent of those being observed. (Chapter 4, Section 4.2; Chapter 8)

Q35. How does data governance relate to climate change?

A: Data governance and climate change intersect in two directions. First, data systems contribute to climate change through energy consumption, water usage, and e-waste. Second, data systems are essential tools for understanding and responding to climate change -- climate modeling, environmental monitoring, and renewable energy optimization all depend on large-scale data processing. Responsible data governance must address both: minimizing the environmental costs of data systems while maximizing their environmental benefits. This requires carbon-aware scheduling, energy-efficient model architectures, transparency about environmental impacts, and governance frameworks that account for environmental justice. (Chapter 34)

Q36. What is indigenous data sovereignty?

A: Indigenous data sovereignty asserts that indigenous peoples have the right to control the collection, ownership, and application of data about their communities, lands, and cultural knowledge. It is grounded in the broader principle of indigenous self-determination and responds to the historical use of data as an instrument of colonial control. The CARE Principles (Collective benefit, Authority to control, Responsibility, Ethics) operationalize this sovereignty. Indigenous data sovereignty challenges Western data governance models that center individual rights by insisting on collective rights, relational accountability, and cultural protocols for data use. (Chapters 3, 32, 37)

Q37. What is "digital redlining"?

A: Digital redlining refers to practices in which digital technologies create, reinforce, or reproduce patterns of exclusion and discrimination that mirror historical redlining (the systematic denial of services to residents of specific neighborhoods, typically based on race). Examples include: internet service providers offering slower or more expensive service in low-income neighborhoods; algorithmic credit decisions that disadvantage applicants from historically redlined zip codes; targeted advertising that excludes certain demographic groups from seeing housing or employment ads. Digital redlining demonstrates that the digital divide is not merely about access but about the quality and terms of digital participation. (Chapter 32)

Q38. What is the "right to be forgotten"?

A: The "right to be forgotten" (more precisely, the right to erasure under GDPR Article 17) grants individuals the right to request that a data controller delete their personal data when: the data is no longer necessary for its original purpose, the individual withdraws consent, the individual objects to processing, the data was unlawfully processed, or deletion is required by law. The right is not absolute -- it must be balanced against freedom of expression, public interest, legal claims, and archiving purposes. The concept originated with the 2014 CJEU ruling in Google Spain v. AEPD, which required Google to de-list search results about an individual upon request. (Chapter 7, Section 7.6; Chapter 20)

Q39. What should I do if my data has been part of a breach?

A: If you are notified of a data breach affecting your information: (1) Change passwords for the affected service and any other service where you used the same password. (2) Enable two-factor authentication on all important accounts. (3) Monitor your financial accounts and credit reports for unauthorized activity. (4) Consider placing a credit freeze with major credit bureaus if financial data was exposed. (5) Be alert for phishing attempts -- breaches often lead to targeted phishing using the breached information. (6) Exercise your rights -- under the GDPR, CCPA, and other laws, you may be entitled to information about what was compromised and what remedial measures are being taken. (7) Consider using a service like Have I Been Pwned to monitor for future breaches. (Chapter 30; Appendix E)

Q40. What is the most important thing I should take away from this textbook?

A: The textbook's central argument is captured in Dr. Adeyemi's final question: "What is your responsibility?" The answer has three dimensions. First, epistemic responsibility -- now that you understand how data systems work, who they affect, and how they can cause harm, you cannot claim ignorance. Knowledge creates obligation. Second, professional responsibility -- if you work with data in any capacity, you have a duty to apply the frameworks, methods, and ethical reasoning developed in this textbook to your practice. Third, civic responsibility -- data governance is not just a professional concern but a democratic one. As a citizen, you have the right and the responsibility to participate in shaping the data systems that govern society. The Practitioner's Oath in Chapter 40 provides one framework for living these responsibilities; the textbook's ultimate challenge is for you to develop your own. (Chapter 40)