Case Study 24-2: The UK Biobank — Genetic Surveillance, Equity, and the Data Commons Problem

DataField.Dev

Case Study 24-2: The UK Biobank — Genetic Surveillance, Equity, and the Data Commons Problem

Background

The UK Biobank is one of the most ambitious epidemiological surveillance enterprises ever undertaken. Between 2006 and 2010, 500,000 participants aged 40-69 were recruited across the United Kingdom. Each participant provided: - A blood sample (for DNA extraction and biochemical analysis) - A urine sample - Detailed health and lifestyle questionnaire responses (covering diet, exercise, alcohol use, smoking, occupational history, medical history, and more) - Permission to link their records to National Health Service (NHS) electronic health records, allowing passive tracking of health outcomes over decades

Since enrollment, participants' health records have been updated automatically through NHS data linkages. The UK Biobank now tracks hospitalizations, cancer diagnoses, death, and — since 2019 — accelerometer data from activity monitors worn for one week by 100,000 participants. The complete genetic sequencing of all 500,000 participants was completed in 2023 — making the UK Biobank the world's largest sequenced population cohort.

The resource is available to researchers worldwide through a managed access process: researchers apply for access, describe their research project, and pay a data access fee. More than 40,000 researchers from 100+ countries have accessed UK Biobank data.

UK Biobank participants consented using a "broad consent" model: they agreed that their samples and data could be used for "future health-related research" without specifying particular studies. The broad consent was justified by the necessity of maintaining a flexible resource — if participants had to be recontacted and re-consented for every study, the practical burden would make the resource unworkable.

Broad consent raises specific concerns in the genetic context:

What participants consented to: "Future health-related research conducted by approved researchers under the governance of the UK Biobank access framework."

What this includes: - Research into genetic associations with thousands of diseases (heart disease, cancer, Alzheimer's, depression, schizophrenia) - Research linking genetics to behavioral traits (intelligence, educational attainment, personality) - Research into complex traits including sexual orientation and political preferences (yes — these studies have been conducted using biobank genetic data) - Research by pharmaceutical companies developing drugs for commercial sale - Research by commercial genome sequencing companies - Research that may yield no health benefit for participants

What participants may not have understood they consented to: - The possibility that their genetic data could, in principle, be used to identify them even from aggregate published statistics - That information about their relatives would be partially captured by their genetic contribution - That pharmaceutical companies might use their data to develop drugs priced beyond their reach - That their genetic data might be used to study traits they would consider private or stigmatized

The Re-Identification Risk in Practice

A 2008 paper in PLOS Genetics demonstrated that individuals could be identified as members or non-members of a genetic study based on aggregate statistics (allele frequencies) published in academic papers from that study. This "Homer" attack (named after the lead author, Nils Homer) showed that de-identification of genetic data does not provide the privacy protection it appears to provide.

The response was swift: major biobanks and academic journals stopped publishing fine-grained genetic summary statistics in open access. The National Institutes of Health moved genome-wide association study (GWAS) summary data from open access to controlled access databases.

This response was effective but limited. Researchers with access to controlled-access databases still have substantial data. As sequencing costs have fallen and datasets have proliferated, the re-identification problem has intensified: more genetic data in more hands means more opportunities for unauthorized linking of datasets.

The UK Biobank's governance includes restrictions on attempting to re-identify participants. But governance restrictions apply only to approved users who agreed to them — they do not prevent unauthorized actors from attempting to use publicly available information about UK Biobank participants in conjunction with other genetic databases to achieve re-identification.

The Diversity Problem

The UK Biobank's enrollment between 2006 and 2010 produced a dataset that is approximately 94% white British — reflecting the demographics of participants who volunteered and were accessible, but not the demographics of the UK population. This representational problem has profound implications.

The scientific problem: Most genetic research has been conducted predominantly in populations of European ancestry. Genetic variants associated with disease in European populations may not have the same associations in populations of African, South Asian, East Asian, or mixed ancestry. Drug targets identified through predominantly white datasets may not work equally well — or may have different safety profiles — in more diverse populations.

The equity problem: When genetic research is conducted predominantly on European-ancestry participants, the medical benefits (improved disease understanding, better-targeted drugs, more accurate diagnostic tools) flow disproportionately to European-ancestry populations. Communities that contributed less to the data receive fewer benefits.

The remediation challenge: UK Biobank has not been able to re-recruit to achieve demographic representativeness — the 500,000 participants were enrolled in a specific time window, and the cohort cannot simply be expanded with new participants in a way that changes its demographic composition. Instead, the field has developed separate initiatives targeting diversity: the All of Us Research Program in the United States explicitly targeted diverse participation; several African biobank initiatives have been established; the H3Africa (Human Heredity and Health in Africa) consortium has built African genomic research capacity.

But these separate initiatives create a segregated research infrastructure rather than an integrated one — different datasets for different populations, raising questions about whether the scientific products will be equally applicable across populations or will perpetuate the historical pattern of biomedical research centering European experiences.

The Commercial Appropriation Problem

UK Biobank charges access fees to commercial researchers — including pharmaceutical companies — on the theory that these fees support the resource's sustainability. Major pharmaceutical companies including GlaxoSmithKline, Regeneron, AstraZeneca, and Amgen have accessed UK Biobank data for drug discovery purposes.

Participants consented to their data being used for health-related research. They did not specifically consent to their data enriching pharmaceutical company shareholders. The development of drugs based on UK Biobank data may produce commercial products priced far beyond the reach of typical UK residents — including the participants whose data made the discoveries possible.

This is a form of data exploitation that broad consent permits and that participants may not have fully anticipated: contributing to a commons from which commercial interests extract private value. The UK Biobank's governance attempts to address this through benefit-sharing requirements (pharmaceutical companies are expected to publish their findings in ways that benefit the scientific community), but direct benefit-sharing with participants is not part of the framework.

Discussion Questions

UK Biobank participants consented to "future health-related research" — which included research into behavioral and psychological traits, research by pharmaceutical companies, and research that may produce commercially valuable products. Was this consent adequately informed? What specifically would you require to be disclosed before participants could meaningfully consent to contributing to a biobank?
The chapter notes that genetic data creates third-party exposure — your relatives are partially characterized by your genome. UK Biobank participants could not consent on behalf of their relatives. How should biobank governance handle this unavoidable feature of genetic data?
UK Biobank's 94% white British enrollment was not a deliberate policy choice but a consequence of who volunteered and was accessible. Is the resulting representational inequity morally problematic, even though no one intended it? Who bears responsibility for remediation, and what should remediation look like?
Pharmaceutical companies use UK Biobank data to discover drug targets, then develop and patent drugs that may be priced beyond the reach of the participants whose data contributed to the discovery. Is this unjust? If so, what specific mechanisms would you propose to address it? Consider: a) participant benefit-sharing (cash payments or drug discounts); b) open licensing requirements for discoveries based on public data; c) caps on drug pricing for products developed using publicly subsidized data.
The chapter's discussion of re-identification concludes that governance restrictions (prohibiting re-identification attempts) cannot prevent unauthorized actors from attempting re-identification. Does this mean that genetic biobanking is inherently incompatible with adequate privacy protection? Or are there technical or governance measures that could reduce the re-identification risk to acceptable levels? What would "acceptable" mean in this context?

This case study connects to Chapter 7 (biometrics and bodily data), Chapter 24's main text on biobanks and genetic surveillance, and Chapter 31 (legal frameworks for health data including GDPR provisions on sensitive data). For additional context, see the UK Biobank's own ethics documentation at ukbiobank.ac.uk.

Case Study 24-2: The UK Biobank — Genetic Surveillance, Equity, and the Data Commons Problem

Background

The Consent Architecture

The Re-Identification Risk in Practice

The Diversity Problem

The Commercial Appropriation Problem

Discussion Questions