Privacy Frontiers in Health Data: Genomics (Part 1)


The modern landscape of health and clinical information has experienced some tectonic shifts in the quarter century since its governing foundation stone – The Health Insurance Portability and Accountability Act of 1996 (HIPAA) – was laid. Nowhere is this more apparent than in efforts to balance the see-saw of data utility and patient privacy against the tremors of emerging technologies and expanding data reservoirs. For experts in disclosure risk analysis, this challenge is at its most vigorous – yet also its subtlest – when faced with one seismic development in particular: the advent of genetic and genomic data in healthcare.

The genomic era 

The revolutionary potential of clinical genetic research was emphatically declared in 2003 with the completed sequencing of ~85% of the human genome by the National Institutes of Health’s Human Genome Project (Guttmacher & Collins (2003), Nurk et al. (2022)). The following two decades tell a story of grit and grain as clinical science worked to leverage the opportunities of the genomic era into tangible predictive, diagnostic and therapeutic techniques. We have only to glance at the progression of oncology research to be impressed – a world in which drug development can be directed in response to critical genomic variations being associated with specific tumors is a world in which many lives are profoundly improved. 

Of course, progress is rarely a smooth, unobstructed path. If the past twenty years crystallized many of the promises of the genomic era, they also illuminated the scale of its challenges. To the extent that genetic sequencing is a set of keys, how do they fit into some extremely complicated locks? Where not one but multiple genetic alterations are at play behind many conditions? Where a multitude of lifestyle, age and demographic considerations beyond genetic drivers are relevant to disease? Such questions, which have long loomed large, are increasingly being tackled thanks to another revolutionary turn in the scientific world – the blossoming of data science, artificial intelligence and machine learning. The ability to extract meaningful insights and patterns from vast amounts of clinical data – genetic sequencing and other health data – using these tools is tantalizing and opens a new path of progression in the field in coming years.

Questions for privacy experts

With both the past and future filled with promise for genetic data, the present time for commercial healthcare data constitutes a curious fulcrum for privacy experts. Over the last few years, we have seen the number of datasets with a significant component of genetic sequencing or test data start to rapidly tick up, with every sign that this trend will accelerate in tandem with further innovations of the field. This raises some important questions:

  • In principle, how might genetic data risk the re-identification of an individual in an ostensibly de-identified dataset?
  • Do we have precedent or demonstration of such risks as reference?
  • What does current legislation require?
  • Given the above, what should an underlying framework for assessing and mitigating disclosure risk from genetics data look like?
  • What research is still needed to more precisely target this risk, and so preserve greater privacy and utility?

At a high-level, we will consider each of these questions in Parts 1 and 2 of this article.

Where is the risk?

Let us suppose that some de-identification steps under HIPAA have been applied to a health dataset, with the intention that personally identifying information (PII) has been appropriately removed or modified to prevent individuals within that dataset from being re-identified. Beyond removing or encrypting directly identifying information like names, this typically takes the form of redacting or aggregating demographic and geographic data so that the group of individuals sharing the same identifiers is sufficiently large that it effectively obscures their identities. 

If genetic data is also present in this dataset, it is important to appreciate that the scope of such information encompasses both ‘direct genetic data’ emerging from sequencing, and ‘indirect genetic data’ which may include biochemical tests, structural variation analysis results and so on. For a privacy expert, the key qualification for genetic data, either direct or indirect, is whether there is the potential for revealing significant information about an individual’s genetic sequence.    

Two main classes of disclosure risk are apparent from genetic data:

  1. Risk from combining ostensibly de-identified datasets
    First, let us consider that two (or more) de-identified datasets may contain different sets of information about an individual which may each have a ‘very small’ risk of identifying that individual under HIPAA. However, in combination, information from these datasets may carry a high risk of re-identification. 

    Now suppose that a highly specific piece of genetic information (such as a long sequence) is mutually present in these datasets; it may be possible to ascertain that the records containing it belong to the same individual. The combined information on this individual then has the potential to be high-risk because the linkage has effectively created a new combined dataset.
  1. Risk from linkage to identified data
    This often manifests as the risk of genetic information itself being linked to publicly accessible (or at least reasonably available) data. This may be identified medical data, the risk of which is not dissimilar to the proliferation of medical record numbers and other familiar data. But it also covers a host of non-medical endeavors which collect copious amounts of genetic data. Some of these, like genealogy databases (e.g., Ancestry.com and FamilyTreeDNA.com)  are both growing in popularity and contain highly identifying information like surnames. 

The second class of risk, in particular, encompasses the unique reality of genetic data that its propagation through multiple spheres of application and use case is both dynamic and hard to fully assess in its scope. Indeed, the historical perspective on genetic data disclosure risk has often been influenced most dramatically by research into the possibility of linkage to public data. However, both classes of risk are significant and require the careful assessment framework that Part 2 of this article discusses.

How significant has this risk been historically?

Early days

In the early and mid 2000’s, genetic sequencing grew in sophistication (next-generation sequencing arriving in 2006) but it was not widely considered that the data significantly increased disclosure risk. This was not a universal state of affairs – works like Malin & Sweeney (2004) and Lin et al (2004) explored the idea of elevated risk through linkage to publicly available records – but there was comparatively little focus or consensus on this topic.

It was not until the later 2000s and early 2010s that the scale of impact that genetic data could have on disclosure risk was understood and recognized. This period was marked by several prominent shifts in both the scientific and legislative privacy spheres, often progressing simultaneously. These landmark events reconfigured the landscape and set the stage for the current approaches to genetic data risk that we take a decade later.

One such shift was precipitated by the Genetic Information Nondiscrimination Act of 2008 (GINA) which codified genetic information as health information for the first time, albeit with respect to employment and insurance discrimation legislation. This had a significant impact and would help lead to the formal inclusion of genetic data as protected health information (PHI) in the most recent major contribution to HIPAA, the Omnibus Rule (2013). The full implications of this inclusion merit involved discussion to which we will return to in Part 2 of this article.

The Erlich group study

Meanwhile, in the scientific domain, the prevailing wisdom that genetic data did not greatly increase disclosure risk was being met with a new and robust challenge from Yaniv Erlich and his research group. The question they asked was whether it would be possible to profile short tandem repeats (STR) on the Y chromosome and relate these via targeted queries to genetic data stored within publicly available genealogy databases. As popular ways to re-construct ancestry trees, these databases naturally contained surnames. Such information can directly identify individuals and must be redacted from datasets under the HIPAA Privacy Rule’s Safe Harbor prescription for de-identification. 

The Erlich group focused their efforts on an ostensibly de-identified sequencing dataset containing ~135,000 individuals, and demonstrated that the answer to their question was a resounding ‘yes’. Re-identification through surname was shown to be eminently feasible – to the tune of 12% of members of the sequencing dataset (Gymrek et al (2013)). The scale of this conclusion is hard to overstate; beyond the individuals within their study, they estimated that several million males in the United States would be vulnerable to identification simply through their familial relationship to those revealed directly. 

Before alarm bells ring too loudly for readers, it should be clarified that neither the names of individuals nor precise details of the methodology that might have allowed it to be replicated directly were revealed in the Erlich study. However, the veracity of the results were in little doubt and were verified explicitly for a small sample of identified donors to the 1000 Genomes Project.

Increasing clinical and academic attention

Where alarm was rightly taken was within the genetic clinical data community. Further studies like Sweeney et al (2013), Naveed et al (2015) and Bonomi et al (2020) delved deeper into conceptualizing genomic data privacy as a complex problem requiring rigorous solutions. Meanwhile, multiple initiatives involving genetic data transfers were swiftly subjected to tight access controls and security protocols by bodies like the National Institute for Health. A wave of research was conducted – and continues – into sophisticated encryption methods like honey encryption (which produces a plausible-looking but meaningless result when decrypted with the wrong key) that could be applied to genetic sequences (Huang et al (2018)). These are designed to maximize the utility of genetic data across the research community while still safeguarding genetic information from attack. However, against this backdrop of advancing towards optimized security measures, privacy experts must still ask themselves these questions – what does it take to consider genetic data de-identified under HIPAA? And how should this work in practice? We will tackle both of these questions in Part 2 of this article.


Thank you to David Copeland, PhD, for significant contributions to this article, to Jonah Leshin, PhD, Rebecca Slisz, Kyle McLean, PhD, Ben Thackray, PhD,  and Adja Touré, PhD, for their feedback, and to James Gow, David Copeland, PhD, Elaine Mitchell, PhD, and Adja Touré, PhD, for their foundational work upon which this article draws.


Bonomi L, Huang Y, Ohno-Machado L. Privacy challenges and research opportunities for genomic data sharing. Nat Genet. 2020 Jul;52(7):646-654. doi: 10.1038/s41588-020-0651-0. Epub 2020 Jun 29. PMID: 32601475; PMCID: PMC7761157.

Guttmacher AE, Collins FS. Welcome to the genomic era. N Engl J Med. 2003 Sep 4;349(10):996-8. doi: 10.1056/NEJMe038132. PMID: 12954750

Gymrek M, McGuire AL, Golan D, Halperin E, Erlich Y. Identifying personal genomes by surname inference. Science. 2013 Jan 18;339(6117):321-4. doi: 10.1126/science.1229566. PMID: 23329047.

Z. Huang, E. Ayday, J. Fellay, J. -P. Hubaux and A. Juels, “GenoGuard: Protecting Genomic Data against Brute-Force Attacks,” 2015 IEEE Symposium on Security and Privacy, 2015, pp. 447-462, doi: 10.1109/SP.2015.34.

Lin Z, Owen AB, Altman RB. Genomic research and human subject privacy. Science. 2004 Jul 9;305(5681):183. doi: 10.1126/science.1095019.

Malin B, Sweeney L. How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems. J Biomed Inform. 2004 Jun;37(3):179-92. doi: 10.1016/j.jbi.2004.04.005. PMID: 15196482.

Naveed M, Ayday E, Clayton EW, Fellay J, Gunter CA, Hubaux JP, Malin BA, Wang X. Privacy in the Genomic Era. ACM Comput Surv. 2015 Sep;48(1):6. doi: 10.1145/2767007. PMID: 26640318; PMCID: PMC4666540.

Nurk, Sergey, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V. Bzikadze, Alla Mikheenko, Mitchell R. Vollger, et al. 2022. “The Complete Sequence of a Human Genome.” Science 376 (6588): 44–53.

Latanya Sweeney, Akua Abu, and Julia Winn. 2013. “Identifying Participants in the Personal Genome Project by Name.” Data Privacy Lab, IQSS, Harvard University. Type of Work: White paper