Exercises: The Data All Around Us

DataField.Dev

Exercises: The Data All Around Us

These exercises progress from concept checks to challenging applications. Estimated completion time: 3-4 hours.

Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)

Part A: Conceptual Understanding ⭐

Test your grasp of core concepts from Chapter 1.

A.1. Section 1.1.1 describes data as a "representation" of reality, comparing it to a photograph rather than a window. Using this metaphor, explain what it means to say that every dataset is a reduction. Provide an example not found in the chapter of a dataset where what is left out might be more significant than what is included.

A.2. Explain the distinction between data, information, and knowledge as presented in Section 1.1.2. Then apply this distinction to the following raw values: "4.2, 3.8, 4.5, 2.1, 4.9." Write one sentence that transforms this data into information, and another sentence that transforms the information into knowledge.

A.3. A colleague tells you: "Metadata isn't a big deal — it's just technical stuff like file sizes and timestamps. The real privacy concern is the content of communications, not the metadata." Drawing on Section 1.1.3 and the MetaPhone study, explain why this claim is incorrect or incomplete.

A.4. In your own words, explain the difference between digitization and datafication as described in Section 1.2. Why does this distinction matter for understanding data governance?

A.5. Section 1.2.2 defines data exhaust and lists four reasons it is consequential. Identify which of those four reasons is most relevant to Eli Okonkwo's experience with Smart City sensors in his Detroit neighborhood, and explain your reasoning.

A.6. Draw or describe a diagram illustrating the seven stages of the data lifecycle (Section 1.4.1). For each stage, write one question that a data governance professional should ask. Your questions should differ from those listed in the chapter.

A.7. Section 1.3.1 explains that the boundary between personal and non-personal data is "far less clear than it appears." In two to three sentences, explain why data that appears to be non-personal can sometimes be re-identified as personal data. What does this imply for organizations that claim to collect only "anonymized" data?

Part B: Applied Analysis ⭐⭐

Analyze scenarios, arguments, and real-world situations using concepts from Chapter 1.

B.1. Consider the following scenario:

A university introduces a new campus app. The app's stated purpose is to help students find open study spaces in the library and student union. To do this, the app uses Bluetooth beacons to detect how many phones are near each study area. The app also collects each student's device ID, the times they visit each building, how long they stay, and which floors they frequent. This data is stored on servers managed by the app developer, a private company based in another country.

Using the data lifecycle framework from Section 1.4, trace this data through all seven stages. At which stage(s) do you see the greatest governance risks? Identify at least three specific concerns.

B.2. Section 1.3.2 distinguishes between structured, unstructured, and semi-structured data. For each of the following, classify the data type and explain the governance challenge it presents:

(a) A spreadsheet of employee salaries, departments, and hire dates
(b) A collection of 10,000 customer service phone call recordings
(c) A set of JSON files containing user preferences exported from a social media platform
(d) Security camera footage from a retail store
(e) A relational database of patient prescription records

B.3. The "nothing to hide" argument is introduced in Section 1.5.2. Read the following statement carefully:

"I don't mind if companies collect my data. I'm not doing anything wrong, and besides, the personalized recommendations I get make my life easier. If giving up some privacy is the price for free services, that's a trade I'm willing to make."

Identify at least three assumptions embedded in this argument. For each assumption, explain why it may be flawed or incomplete, drawing on concepts from Sections 1.2, 1.3, and 1.5.

B.4. Mira's supervisor classified student GPA data as "sensitive" but campus mental health service usage data as merely "confidential" (Section 1.3.3). Construct an argument for why mental health service data should receive at least the same level of protection as GPA data. In your argument, consider: (a) the potential harm from unauthorized disclosure, (b) the power dynamics involved, and (c) the chilling effect that lower protections might have on students seeking help.

B.5. The VitraMed lifecycle example in Section 1.4.2 ends with Eli's observation that derivative models trained on patient data persist even after the original data is deleted. Analyze this situation from the perspective of three different stakeholders — the patient, the company (VitraMed), and a public health researcher. What does "deletion" mean to each of them? Where do their interests conflict?

B.6. Section 1.2.3 raises the question of whether quantified self tracking is truly voluntary when health insurers or employers offer incentives for participation. Analyze the following scenario:

A mid-size company announces a "Wellness Rewards" program. Employees who wear a company-provided fitness tracker and log at least 8,000 steps per day for 200 days of the year receive a $600 reduction in their annual health insurance premium. Participation is "optional."

Is participation meaningfully voluntary? Identify the pressures — economic, social, and informational — that might influence an employee's decision. What data governance concerns does this program raise?

Part C: Real-World Application Challenges ⭐⭐-⭐⭐⭐

These exercises ask you to investigate your own data environment. Complete them with your actual devices, services, and experiences.

C.1. ⭐⭐ Personal Data Audit. Using the "Day in Data" framework from Section 1.2.1 as a model, conduct your own data audit for a single day. List every device, app, service, and system you interact with from waking to sleeping. For each, identify: (a) what data is likely collected, (b) whether the collection is the primary purpose or data exhaust, (c) who controls that data (the company or organization), and (d) whether you actively consented to the collection. Present your findings in a table. Write a one-paragraph reflection on what surprised you.

C.2. ⭐⭐ Metadata Experiment. Select five emails you have sent in the past week. For each email, list all the metadata you can identify (sender, recipient, timestamp, subject line, IP address if accessible, device used, etc.). Now imagine that someone had access only to this metadata for an entire year of your email activity — no content, just metadata. Write a paragraph describing what patterns, relationships, and inferences they could plausibly draw about your life.

C.3. ⭐⭐⭐ Privacy Policy Comparison. Select two apps or services you use regularly (e.g., a social media platform and a food delivery app). Find and read the relevant sections of each service's privacy policy. For each, answer: (a) What categories of data are collected? (b) How is that data shared with third parties? (c) How long is it retained? (d) What rights do you have to access, correct, or delete your data? Write a one-page comparison noting the differences. Which policy is more protective of the user, and why?

C.4. ⭐⭐⭐ Sensitive Data Classification Exercise. Review the sensitive data categories listed in Section 1.3.3. Now think about a specific organization you interact with — your university, employer, bank, or healthcare provider. Make a list of all the data that organization likely holds about you. Classify each item using the categories from Section 1.3.3 (health, financial, biometric, location, etc.) and indicate whether you consider each item's current protection level adequate. Identify at least one item where you believe the protection is insufficient and explain why.

Part D: Synthesis & Critical Thinking ⭐⭐⭐

These questions require you to integrate multiple concepts from Chapter 1 and think beyond the material presented.

D.1. Section 1.5.3 argues that "data systems are social systems" — that they reflect and reinforce existing social structures. Apply this argument to the concept of data exhaust (Section 1.2.2). Consider: Does data exhaust affect all people equally? Think about differences in the types and volumes of data exhaust generated by people of different socioeconomic backgrounds, ages, geographic locations, or immigration statuses. Write a two-paragraph analysis explaining how data exhaust generation can create or deepen social inequalities.

D.2. The chapter presents two contrasting perspectives through Mira and Eli. Mira initially focuses on the benefits of data systems (VitraMed saving lives through predictive analytics), while Eli focuses on the harms (surveillance in his neighborhood without community consent).

Write a synthesis that does not simply "split the difference" but instead proposes a framework for evaluating when data collection is justified and when it is not. Your framework should include at least three criteria. Test your framework against the VitraMed example and the Detroit Smart City sensor example — does it produce different results for the two cases? Why or why not?

D.3. The chapter introduces the data lifecycle as a seven-stage linear model with a feedback loop (Section 1.4.1). Critique this model. What does it capture well? What does it miss? Consider the following challenges: data that is collected but never analyzed; data that is shared before it is processed; data that cannot truly be deleted because it has been memorized by a machine learning model; and data that is created synthetically (generated by AI, not collected from the real world). Propose at least one modification or extension to the lifecycle model that addresses one of these gaps.

D.4. Dr. Adeyemi poses the question: "When someone says 'it's just data,' what are they not telling you?" (Section 1.2.2). Building on all six sections of Chapter 1, write a response to this question in the form of a short essay (300-500 words). Your essay should reference at least four specific concepts from the chapter and explain how the phrase "it's just data" functions as a rhetorical strategy that obscures important realities.

Part E: Research & Extension ⭐⭐⭐⭐

These are open-ended projects for students seeking deeper engagement. Each requires independent research beyond the textbook.

E.1. The AOL Case, Revisited. Section 1.3.1 briefly describes the 2006 AOL search log release, in which "anonymized" data was re-identified. Research this incident in depth. Write a 1,000-word report covering: (a) what AOL released and why, (b) how re-identification occurred, (c) the consequences for the individuals identified and for AOL as a company, (d) what data governance practices, if implemented, could have prevented the incident, and (e) whether modern anonymization techniques would have made a difference. Use at least three sources beyond this textbook.

E.2. Quantified Self in Practice. Section 1.2.3 introduces the quantified self movement and raises questions about agency, accuracy, and data destination. Conduct a small-scale investigation: install a fitness or health tracking app (or use one you already have). Use it for one week, then request a copy of your data from the provider (most services allow data export or subject access requests). Analyze what was collected. Write a report (800-1,200 words) addressing: (a) what data was collected that you expected, (b) what data was collected that surprised you, (c) where your data was sent (check the privacy policy and any data export), and (d) how the experience changed your understanding of the concepts in this chapter.

E.3. Data Governance Across Borders. Section 1.3.1 mentions the EU's GDPR definition of personal data. Research how at least two other jurisdictions (e.g., the United States, China, Brazil, India, or Japan) define "personal data" or its equivalent. In a comparative analysis (600-1,000 words), explain: (a) how the definitions differ, (b) what kinds of data fall into a gray area in one jurisdiction but are clearly protected in another, and (c) what practical challenges these differences create for a global company like VitraMed that operates across borders.

Solutions

Selected solutions are available in appendices/answers-to-selected.md.