Quiz: The Data All Around Us

DataField.Dev

Quiz: The Data All Around Us

Test your understanding before moving to the next chapter. Target: 70% or higher to proceed.

Section 1: Multiple Choice (1 point each)

1. Which of the following best describes the relationship between data, information, and knowledge as presented in this chapter?

A) Data, information, and knowledge are three words for the same concept, used interchangeably depending on the discipline.
B) Data is raw facts without context; information is data organized and contextualized; knowledge is information interpreted through experience and judgment.
C) Data is digital; information is analog; knowledge is the combination of both.
D) Data is objective; information is subjective; knowledge requires formal education to produce.

Answer

**B)** Data is raw facts without context; information is data organized and contextualized; knowledge is information interpreted through experience and judgment. *Explanation:* Section 1.1.2 defines these three concepts as a hierarchy of increasing interpretation. Raw temperature readings (37.2, 37.8, 38.1) are data; recognizing a rising fever pattern is information; diagnosing a bacterial infection based on that pattern and clinical experience is knowledge. Option A ignores the meaningful distinctions. Option C introduces an analog/digital divide that the chapter does not support. Option D incorrectly equates knowledge with formal education rather than interpretive judgment.

2. A researcher collects anonymized survey responses about workplace satisfaction. Which of the following would be classified as metadata in this context?

A) An employee's written response: "I feel undervalued by my manager."
B) The timestamp of when each survey was submitted and the IP address of the device used.
C) The aggregate finding that 62% of respondents reported dissatisfaction.
D) The company's decision to restructure management based on survey results.

Answer

**B)** The timestamp of when each survey was submitted and the IP address of the device used. *Explanation:* Section 1.1.3 defines metadata as "data that describes other data." The survey response (A) is the primary data itself. The aggregate finding (C) is information derived from the data. The company decision (D) is action taken based on analysis. The timestamp and IP address (B) are data *about* the survey response — when it was created and from where — which is precisely what metadata is. Note that as the chapter warns, this metadata could potentially be used to re-identify "anonymous" respondents.

3. Datafication, as described in this chapter, is best defined as:

A) The process of converting paper documents into digital files.
B) The transformation of qualitative human experiences into quantitative data points that can be tracked, aggregated, and analyzed.
C) The use of databases to store information more efficiently than physical filing systems.
D) The practice of making all government data available to the public.

Answer

**B)** The transformation of qualitative human experiences into quantitative data points that can be tracked, aggregated, and analyzed. *Explanation:* Section 1.2 distinguishes datafication from mere digitization. Option A describes digitization — converting analog to digital. Datafication goes further: it transforms aspects of life that were previously unquantified (friendship, movement, attention, health) into data that can be tracked and monetized. Option C describes database storage, a technical matter. Option D describes open data initiatives, an unrelated concept.

4. Eli learns that Smart City lampposts in his Detroit neighborhood collect ambient audio, WiFi probe requests, and license plate numbers in addition to traffic data. The traffic data is the system's stated purpose. The ambient audio and WiFi tracking data would best be classified as:

A) Metadata
B) Structured data
C) Data exhaust
D) Sensitive data

Answer

**C)** Data exhaust *Explanation:* Section 1.2.2 defines data exhaust as information generated as a side effect of digital activities — data that is not the stated purpose of the interaction but is collected as a byproduct. The lampposts were deployed for traffic optimization; the audio, WiFi probes, and license plates are byproduct data collected alongside the primary function. While this data may also be sensitive (D), the question asks for the best classification of its *relationship* to the stated purpose, which is data exhaust. Metadata (A) would describe data about other data, not independent data streams. The data may be structured (B), but that describes its format, not its origin.

5. According to the chapter, approximately what percentage of the world's data is unstructured?

A) 20-30%
B) 40-50%
C) 60-70%
D) 80-90%

Answer

**D)** 80-90% *Explanation:* Section 1.3.2 states that "an estimated 80-90% of the world's data is unstructured." This matters for governance because most data protection regulations were designed with structured data in mind — clear fields like name, address, and date of birth. Governing unstructured data (smart speaker recordings, facial images in crowds, mouse movement patterns) presents fundamentally harder challenges.

6. Which of the following is the strongest reason why the distinction between personal and non-personal data is described as "far less clear than it appears"?

A) Non-personal data is always less valuable than personal data.
B) All data is technically personal because someone had to create it.
C) Data that appears non-personal can often be re-identified and linked back to specific individuals.
D) Legal definitions of personal data vary slightly between countries.

Answer

**C)** Data that appears non-personal can often be re-identified and linked back to specific individuals. *Explanation:* Section 1.3.1 makes this point through the AOL search log example: data that was "anonymized" by replacing names with numerical IDs was re-identified within days using search pattern analysis. The key insight is that seemingly non-personal data can become personal when combined with other information. Option A is factually incorrect (non-personal data can be immensely valuable). Option B is a philosophical overreach not supported by the chapter. Option D, while true, describes a surface-level complexity rather than the fundamental problem the chapter highlights.

7. In the VitraMed data lifecycle example, patient data is scheduled for deletion seven years after the last clinical visit. Eli objects that derivative models trained on the data persist indefinitely. This objection highlights a limitation at which stage of the data lifecycle?

A) Collection — because the data should not have been collected in the first place.
B) Processing — because the data was transformed in ways the patient did not anticipate.
C) Deletion — because "deleting" source data does not eliminate patterns learned by models trained on it.
D) Sharing — because VitraMed should not share data with third parties.

Answer

**C)** Deletion — because "deleting" source data does not eliminate patterns learned by models trained on it. *Explanation:* Section 1.4.2 presents Eli's challenge directly: if patterns from patient data live on indefinitely in predictive models, then deleting the original records does not achieve true deletion. This raises the question of what "deletion" actually means in an era of machine learning. The issue is not that data should never have been collected (A), nor specifically about processing transformations (B) or third-party sharing (D), though those raise separate concerns. Eli's objection targets the gap between formal deletion and functional persistence.

8. The chapter presents an "Applied Framework" for evaluating any data system. Which of the following questions does the framework explicitly include?

A) "How much revenue does the data generate?"
B) "What governance mechanisms exist — and are they adequate?"
C) "Is the data stored in the cloud or on-premises?"
D) "How many data points are in the dataset?"

Answer

**B)** "What governance mechanisms exist — and are they adequate?" *Explanation:* Section 1.7 presents the six-question Applied Framework. The six questions ask: What data is collected? By whom? For what stated purpose? What unstated purposes might it serve? Who benefits and who bears the risk? And: What governance mechanisms exist — and are they adequate? Revenue generation (A), storage architecture (C), and dataset size (D) are not part of this framework. The framework is designed to surface power dynamics and accountability, not technical specifications.

9. Which of the following scenarios best illustrates the concept of the "quantified self" as discussed in Section 1.2.3?

A) A hospital using electronic health records to track patient outcomes across departments.
B) A government census that counts the population every ten years.
C) An individual using a smartwatch to track daily steps, heart rate, sleep quality, and calories consumed, then reviewing weekly trends.
D) A company analyzing employee keystrokes to measure productivity.

Answer

**C)** An individual using a smartwatch to track daily steps, heart rate, sleep quality, and calories consumed, then reviewing weekly trends. *Explanation:* Section 1.2.3 defines the quantified self as "the voluntary use of technology to track personal metrics" at the individual level. Option C describes voluntary, individual-level self-tracking using consumer technology, which is the core of the movement. Option A is institutional health data management. Option B is government data collection. Option D is employer surveillance — notably involuntary — not self-tracking. The chapter is careful to distinguish voluntary self-quantification from institutional data collection, even while noting that the voluntariness of quantified self practices can be complicated by employer or insurer incentives.

10. Dr. Adeyemi responds to the "nothing to hide" argument by pointing out that privacy protects more than just concealment of wrongdoing. According to the chapter, privacy also protects:

A) Corporate trade secrets and competitive advantages.
B) Autonomy, dignity, intellectual freedom, and political dissent.
C) The efficiency of government data collection programs.
D) The right to monetize one's own personal data.

Answer

**B)** Autonomy, dignity, intellectual freedom, and political dissent. *Explanation:* Section 1.5.2 quotes Dr. Adeyemi directly: "The 'nothing to hide' argument assumes that the only purpose of privacy is to conceal wrongdoing. But privacy also protects autonomy, dignity, intellectual freedom, and political dissent." This reframing is central to the chapter's argument that privacy is a social and structural concern, not merely an individual one. Options A, C, and D describe other concepts (trade secrets, government efficiency, data monetization) that are not part of this specific argument.

Section 2: True/False with Justification (1 point each)

For each statement, determine whether it is true or false and provide a brief justification.

11. "Data exhaust is inherently less valuable than data that is intentionally collected, because it is a mere byproduct of other activities."

Answer

**False.** *Explanation:* Section 1.2.2 explicitly states that data exhaust "is often more revealing than primary data" and that "companies have built billion-dollar businesses on the collection and analysis of data exhaust." The entire digital advertising industry depends on it. Eli's example of Smart City sensors further illustrates that the byproduct data (audio, WiFi probes, license plates) may be more consequential than the primary traffic data. The "byproduct" label does not imply lesser value — in many cases, data exhaust is the most commercially and socially significant data a system produces.

12. "Under the GDPR, a cookie ID that can be linked back to a specific person qualifies as personal data."

Answer

**True.** *Explanation:* Section 1.3.1 states directly: "Under the EU's General Data Protection Regulation (GDPR), even a cookie ID that can be linked back to you counts as personal data." The GDPR defines personal data broadly as any information relating to an identified or identifiable natural person. This expansive definition means that many types of data often assumed to be non-personal — device identifiers, IP addresses, cookie IDs — fall within its scope when they can be connected to an individual.

13. "The data lifecycle model presented in the chapter is strictly linear — once data passes through one stage, it does not return to earlier stages."

Answer

**False.** *Explanation:* Section 1.4.1 explicitly includes a feedback loop in the lifecycle diagram, showing that data can cycle back from deletion to collection. The text labels this "(or back to collection: feedback loops)." In practice, insights from analysis may trigger new collection, shared data may be re-processed by recipients, and retention decisions may lead to re-analysis. The lifecycle is presented as generally sequential but with acknowledged circularity.

14. "The chapter argues that the classification of certain data as 'sensitive' is an objective, universal determination based on the inherent nature of the data itself."

Answer

**False.** *Explanation:* Section 1.3.3 presents data classification as a governance *decision* with significant consequences, not an objective fact. The Mira example illustrates this directly: her university classified GPA data as "sensitive" but mental health service usage as merely "confidential," even though a strong case can be made that mental health data warrants greater protection. The chapter notes that the classification "was written by IT, not by anyone who thought about it from the student's perspective." Sensitivity classifications reflect institutional priorities, power dynamics, and the perspective of whoever designs the classification system.

15. "According to the chapter, the primary reason data governance matters is to ensure that organizations comply with government regulations."

Answer

**False.** *Explanation:* While regulatory compliance is one aspect of data governance, the chapter frames the issue far more broadly. Section 1.5 argues that data governance exists because "the gap between what data *can* do and what it *should* do is not self-correcting." Section 1.5.3 identifies data governance as a social and ethical concern, stating that "data systems are social systems" that reflect and reinforce social structures. The chapter's argument is that governance is necessary to address power asymmetries, protect human rights, and ensure accountability — goals that go well beyond regulatory compliance.

Section 3: Short Answer (2 points each)

16. Explain why former NSA General Counsel Stewart Baker's statement — "If you have enough metadata, you don't really need content" — is significant for understanding data privacy. Use the MetaPhone study example to support your answer.

Sample Answer

Baker's statement is significant because it challenges the common assumption that metadata is trivial compared to content. Most people think of metadata as mundane technical details (timestamps, file sizes), but Baker — speaking from inside the intelligence community — acknowledges that metadata reveals the patterns, relationships, and behaviors of a person's life. The MetaPhone study at Stanford demonstrated this concretely: from phone metadata alone (no call content), researchers identified a marijuana grower (calls to dispensaries and hydroponics stores), a person with a heart condition who owned a firearm, and a person who was likely pregnant. These inferences were drawn entirely from patterns of who called whom, when, and for how long. This matters for data privacy because metadata is often subject to weaker legal protections than content, yet it can paint an equally intimate — or more intimate — portrait of a person's life. *Key points for full credit:* - Recognizes that metadata can be as revealing as content - References the MetaPhone study with at least one specific example - Notes the governance implication: metadata often receives weaker protection despite its revealing nature

17. The chapter uses the phrase "data is people" (from the opening Cathy O'Neil quote) and later argues that "data systems are social systems" (Section 1.5.3). Explain the connection between these two ideas. Why does the chapter emphasize them?

Sample Answer

The two ideas are connected because they both push against the misconception that data is neutral, abstract, or purely technical. "Data is people" reminds us that behind every data point is a human being whose life is being represented, reduced, and acted upon. "Data systems are social systems" extends this to the institutional level: the systems that collect, store, analyze, and act on data are designed by people, shaped by existing power structures, and experienced differently by different communities. A loan algorithm that denies credit to residents of historically redlined neighborhoods is not simply "following the data" — it is reproducing racial discrimination through a technical system. The chapter emphasizes these ideas because its central thesis is that data governance is a social and ethical challenge, not merely a technical one. Understanding this prevents us from treating data problems as engineering problems that can be solved without engaging questions of power, justice, and accountability. *Key points for full credit:* - Connects "data is people" (individual level) to "data systems are social systems" (institutional level) - Explains why the neutrality of data is a misconception - Provides at least one example of how data systems reproduce social structures

18. Describe two governance challenges that are specific to unstructured data (as opposed to structured data). Why are these challenges harder to address with existing regulatory frameworks?

Sample Answer

First, unstructured data is difficult to search, audit, and apply rules to. A structured database with fields like "name" and "date of birth" can be systematically scanned for personal data and subjected to access controls, retention schedules, and deletion procedures. Unstructured data — such as a conversation captured by a smart speaker, free-text notes in a medical record, or a photograph — may contain personal information embedded in ways that are not easily searchable or identifiable by automated systems. Second, unstructured data is harder to classify consistently. A single photograph might simultaneously contain biometric data (a person's face), location data (a street sign in the background), and health data (a visible medical device). Applying appropriate protections requires understanding the content in context, which is far more complex than applying rules to labeled database fields. Existing regulatory frameworks were largely designed with structured data in mind — databases with clear fields and categories. They struggle to address the ambiguity and richness of unstructured data, which constitutes 80-90% of all data generated. *Key points for full credit:* - Identifies at least two distinct governance challenges specific to unstructured data - Explains why existing frameworks (designed for structured data) are insufficient - References the 80-90% statistic or the examples from Section 1.3.2

19. Using the Applied Framework from Section 1.7, analyze the following situation in three to four sentences: A grocery store chain introduces a loyalty card program. Customers who use the card receive discounts, and the store tracks every item they purchase, when they shop, and which store location they visit.

Sample Answer

The data collected includes itemized purchase history, transaction timestamps, store locations, and any personal details tied to the loyalty card (name, email, phone number). The grocery chain controls this data, and the stated purpose is to provide personalized discounts and improve inventory management. However, unstated purposes might include selling purchase data to third-party advertisers, health insurers, or data brokers — a customer's purchase of alcohol, tobacco, or pregnancy tests could be shared without their knowledge. The customers benefit from discounts, but they bear the risk of profiling, price discrimination, and data breaches; the company benefits disproportionately from the data's long-term analytical value. Governance mechanisms are likely limited to a privacy policy that few customers read, with no independent oversight of how the data is shared or retained. *Key points for full credit:* - Applies at least four of the six framework questions - Distinguishes between stated and unstated purposes - Identifies an asymmetry in who benefits vs. who bears the risk

Section 4: Applied Scenario (5 points)

20. Read the following scenario and answer all parts.

Scenario: FitTrack University

Millbrook University partners with FitTrack, a wearable fitness technology company, to launch a campus health initiative. Every incoming first-year student receives a free FitTrack smartwatch during orientation. The watch tracks steps, heart rate, sleep duration, GPS location on campus, and calories burned. Students who average 7,500 steps per day for the semester receive a $200 bookstore credit.

The university's stated goals are to "promote student wellness and reduce first-year dropout rates." FitTrack's press release adds that the data will help them "develop next-generation health insights." Data is stored on FitTrack's cloud servers. The university's institutional research office receives aggregate reports (e.g., average sleep duration by residence hall). FitTrack retains individual-level data for "product improvement and research partnerships."

Midway through the semester, a student newspaper investigation reveals that FitTrack has shared de-identified data with a health insurance company interested in actuarial modeling for young adults. FitTrack's privacy policy — which students clicked "I agree" to during a hectic orientation day — permits this sharing.

(a) Identify at least four types of data being collected and classify each as personal data, non-personal data, or data whose classification is ambiguous. Explain your reasoning for at least one ambiguous classification. (1 point)

(b) Using the concept of data exhaust from Section 1.2.2, identify at least two examples of data exhaust in this scenario — data that is generated as a byproduct of the wellness program's stated purpose. (1 point)

(c) Trace the data through the seven-stage lifecycle. At which stage(s) do the most significant governance failures occur? Identify at least two. (1 point)

(d) Evaluate whether student participation in this program is meaningfully voluntary, drawing on the chapter's discussion of the quantified self (Section 1.2.3) and the "nothing to hide" argument (Section 1.5.2). Consider economic incentives, social pressure, and information asymmetry. (1 point)

(e) Propose three specific governance measures that Millbrook University should have implemented before launching this program. For each measure, explain which stage of the data lifecycle it addresses and what harm it would prevent. (1 point)

Sample Answer

**(a)** Data types collected include: - **Steps, heart rate, calories burned** — personal health/biometric data (clearly personal, as it relates to an identifiable individual via their watch assignment) - **Sleep duration** — personal health data (clearly personal and sensitive, as it reveals behavioral patterns) - **GPS location on campus** — personal location data (clearly personal; reveals which buildings, classes, and social spaces a student frequents) - **Aggregate sleep duration by residence hall** — this classification is ambiguous. In large residence halls, it may be genuinely non-personal. But in small residence halls or specific floors, aggregate data could potentially be linked back to individuals, especially when combined with other data. This mirrors the chapter's point about re-identification: data that appears non-personal can become personal in context. **(b)** GPS location data is data exhaust relative to the wellness program's stated purpose: the program aims to incentivize physical activity (steps), but GPS tracking reveals *where* students go, not just whether they walk — information far beyond what is needed to count steps. Similarly, heart rate variability patterns, which can indicate stress, anxiety, or alcohol consumption, are generated as a byproduct of basic fitness tracking and are not necessary for the step-count incentive. **(c)** The most significant governance failures occur at: - **Collection:** The scope of data collected far exceeds what is necessary for the stated purpose (promoting wellness through step counts). GPS location and continuous heart rate monitoring are not required to count steps. This violates the principle of data minimization. - **Sharing:** FitTrack shared de-identified data with a health insurance company without meaningful student awareness. While technically permitted by the privacy policy, students clicked "I agree" during a hectic orientation — a textbook example of the gap between formal consent and meaningful consent. The sharing stage lacked transparency and independent oversight. - **Retention:** FitTrack retains individual-level data indefinitely for "product improvement and research partnerships" — an open-ended justification with no defined endpoint, which means student data persists long after the wellness program's purpose is served. **(d)** Participation is not meaningfully voluntary despite being technically optional. Three pressures undermine voluntariness: - **Economic pressure:** A $200 bookstore credit is significant for college students, particularly those from lower-income backgrounds. Opting out has a real financial cost. - **Social pressure:** When all first-year students receive watches during orientation, those who decline or stop wearing theirs may face social pressure or feel excluded from a shared campus experience. - **Information asymmetry:** Students clicked a privacy policy during a hectic orientation day. They almost certainly did not understand that their health data would be shared with an insurance company. As the chapter notes in Section 1.2.3, "choice" is constrained when one party has vastly more information about how data will be used than the other. The "nothing to hide" response fails here because students had no practical way to evaluate what they were agreeing to or who would ultimately use their data. **(e)** Three governance measures: 1. **Data minimization policy (addresses Collection):** The university should have required that only the data strictly necessary for the step-count incentive be collected — step counts and student IDs. GPS location, continuous heart rate, and sleep data should not have been collected, as they are not needed for the stated purpose. This would prevent the accumulation of sensitive data that could be misused. 2. **Transparent data sharing agreement with opt-in consent (addresses Sharing):** Before any data could be shared with third parties — especially insurance companies — students should have been informed in plain language and given a separate, specific opt-in choice (not buried in a general privacy policy). This would prevent students' health data from being shared without genuine awareness and consent. 3. **Defined retention limits with automatic deletion (addresses Retention/Deletion):** The university should have required FitTrack to delete individual-level student data within 90 days of each semester's end, with no exceptions for vaguely defined "product improvement." This would prevent indefinite retention of sensitive health data and reduce the risk of future misuse or breach.

Scoring & Review Recommendations

Score Range	Assessment	Next Steps
Below 50% (< 15 pts)	Needs review	Re-read Sections 1.1-1.3 carefully, redo Part A exercises
50-69% (15-20 pts)	Partial understanding	Review specific weak areas, focus on Part B exercises for applied practice
70-85% (21-25 pts)	Solid understanding	Ready to proceed to Chapter 2; review any missed topics briefly
Above 85% (> 25 pts)	Strong mastery	Proceed to Chapter 2: A Brief History of Data and Society

Total possible points: 30 (10 multiple choice + 5 true/false + 8 short answer + 5 scenario + 2 implied from justifications = 30)

Section	Points Available
Section 1: Multiple Choice	10 points (10 questions x 1 pt)
Section 2: True/False with Justification	5 points (5 questions x 1 pt)
Section 3: Short Answer	8 points (4 questions x 2 pts)
Section 4: Applied Scenario	5 points (5 parts x 1 pt)
Total	28 points