19 APRIL 2022

What Does It Mean to “De-identify” Data?

First in a three-part series: “Patient privacy and healthcare data exchange: What privacy and compliance officers need to know to de-identify patient data and stay HIPAA compliant.

Privacy and compliance officers across the healthcare ecosystem today are part of an exciting new landscape of connecting data to improve patient outcomes. Sharing patient health data can improve care continuity, accelerate clinical research and help estimate and manage patient health risk factors. 

One major challenge is ensuring that data sharing with third parties is done in a manner that protects patient privacy at every step. 

In this three-part series, we tackle several important questions facing privacy and compliance officers. We start with how and when an organization’s data is considered “de-identified” under the 1996 Health Insurance Portability and Accessibility Act (HIPAA).

What does it mean to “de-identify” data?

One way to protect patient privacy is to remove identifying information from the health data in question. This can include removing the patient’s name, address, date of birth and any other information that could enable their identity to become known. 

A first step is to “tokenize” data – the process of removing or modifying personally identifying information (PII) and creating anonymized and encrypted records that can be aggregated for research and analytics purposes. Strong tokenization schemes must ensure that the reverse engineering or correlation to other available identifying information will not risk re-identifying any given patient. 

A common misconception, even among those who are familiar with HIPAA, is that “tokenization” equals “de-identification.”  While tokenization is a critical step, it is not enough under HIPAA for the information to be considered de-identified. 

There are only two ways to render data “de-identified” under HIPAA: Safe Harbor and Expert Determination.

Safe Harbor 

Under the Safe Harbor method, a predetermined set of 18 data values must be redacted from a dataset. Safe Harbor is a highly prescriptive method that protects patient privacy. However, it also greatly reduces the utility of the remaining dataset for research purposes. A straightforward example is that removing service dates, such as admission and discharge dates, means that information that could help understand disease progression is no longer available in a dataset.

Expert Determination

The second option is conducting an Expert Determination, which mandates human expert review. In this scenario, a tokenized dataset derived from PII is analyzed by a human expert with deep domain expertise in statistics and data science. The expert performs a statistical analysis of the tokenized values to determine if the tokens pose a very small risk of re-identification or if further “remediation” or data removal is needed. 

How do I choose?

If both Safe Harbor and Expert Determination can provide assurance that a dataset is legally de-identified, how does one decide which method to use?

Expert determination is the preferred method by most organizations looking to make their data available for further research and analysis. The primary reason is that expert determination provides a great deal of flexibility when trying to balance the utility of a dataset without compromising patient privacy. 

For example, in rare diseases with perhaps just a few thousand diagnosed patients, their diagnosis, full zip code and a few other data elements may raise the risk that someone could figure out a patient’s identity or uniquely recognize them in a dataset. 

In such cases, the expert may recommend redaction, modification or removal of additional data elements in the existing tokenized dataset to further reduce risk of re-identification. These recommendations are typically documented in a written expert determination report, otherwise known as a certification. Organizations must then implement these recommendations as remediations to their existing tokenized data. Once this step is complete, the remediated dataset must receive a final review and “certification” by the human expert.


The journey to making data shareable starts with ensuring data is truly de-identified under HIPAA. Tokenization is the first step followed by the HIPAA-mandated methods of either Safe Harbor or expert determination.

The choice depends on the balance between business needs, research objectives and, most importantly, patient privacy protection.

The process of de-identification can be challenging, laborious and time-intensive for organizations to navigate and often takes months to complete. But it doesn’t have to be – if you have the right expertise. In our next installment, we explore expert determination in greater detail and how technology can complement and accelerate the work of human privacy experts to bring transparency and speed to a difficult but critical part of healthcare data exchange.This is the first in a three-part series exploring the three key things every healthcare privacy and compliance officer needs to know about de-identifying patient data.