By Meg Jacobs (Chief Privacy Officer and Compliance Officer, Komodo Health) and Jonah Leshin (Head of Research, Privacy Hub)
Connecting de-identified real world data holds tremendous potential, but that potential can only be realized with thoughtful approaches to patient privacy. Without a careful plan for linking de-identified patient records, researchers may lose the information necessary for impactful analysis, or, on the other end of the spectrum, expose patient records to unwarranted privacy risk.
Establish which data elements are essential for your use case
When using de-identified patient-level data for research use cases, it is critical to understand up front which data points are the most important to keep in order to achieve the research goals. The expert conducting the de-identification determination can work closely with the research team to ensure the outcome is aligned with the goals of the project. De-identification typically requires a balanced approach. For example, if a research use case requires more detailed mortality data, a solution could be designed to retain those fields by offsetting the risk in other fields, such as by merging small geographic areas into larger ones.
Certain privacy considerations may also be heightened in the context of small patient populations, such as a group of patients enrolled in a rare disease clinical trial. Small patient populations are more sensitive to variance as more records are added to the dataset over time. These smaller populations should be monitored more closely to determine if the cohort characteristics change over time, and if so, the appropriateness of the data operations being applied should be re-evaluated to ensure the dataset remains de-identified.
Investing in privacy implementation pays dividends
Translating the aforementioned considerations into successful implementation requires an ongoing commitment to education and policy buy-in across an organization. From engineering to business development to executive leadership, all facets of an organization play a role in compliance and governance. They must each understand basic ground rules and know when to reach out to in-house or external privacy experts. Such an understanding not only protects against violations of privacy; it empowers individuals to be decision makers, thereby accelerating processes around research planning, data preparation, and data linkage.
Without proper education and policy, there are a range of unintended consequences that an organization (and, of course, a patient) may be vulnerable to. One common misunderstanding is the assumption that joining two separate de-identified data sets together means that the newly joined dataset is automatically de-identified, when in fact it requires its own privacy evaluation. This is important and necessary because while two sources of patient information may individually pose very small re-identification risk, when used in combination that risk may increase.
Another critical aspect of privacy preservation that is often overlooked is change management. The dynamic nature of the health data ecosystem requires continued organizational diligence. Over time datasets are updated and new data enters the public domain. It is imperative to ensure the operations and controls applied to the first version of the dataset are continuously maintained and also implemented to subsequent additions to the dataset. With respect to publicly available data, keeping abreast of major changes is critical due to the potential for linkages with a de-identified dataset, which may increase re-identification risk.
Evolving regulation and technology warrant close monitoring, and hold opportunity
Regulation is also evolving over time, with the volume of new legislation that impacts patient privacy accelerating at an unprecedented rate. At the time of writing, over 10 states have enacted their own comprehensive consumer privacy legislation. While these laws all contain carve outs for data that falls under the domain of HIPAA, they have implications for certain research initiatives. For example, social determinants of health and consumer data are critical for studies that seek to understand health data inequities across different socioeconomic and ethnic groups. Researchers who work with this data must ensure that its use is compliant with newly applicable laws. Furthermore, to the extent that recent FTC actions in the context of health data breaches serve as precedent, enforcement of such legislation may not be far behind.
Beyond regulation and the question of what one must do, ethics and the question of what one should do ought to play a central role in decision making due to potential consequences on patient privacy as well as organizational reputation.
In the digital era, we have seen tremendous research gains from structured de-identified data, and accompanying privacy best practices have been established. At the same time, more complex data types and technologies offer new opportunities for insights, and also require us as an industry to think through the appropriate privacy considerations. In particular, there are valuable insights to be gained from unstructured text, imaging, and genomics data. Each of these data types presents its own challenges and opportunities, ranging from individually identifying images (such as tattoos) to privacy implications of different genetic mutation types.
With regard to technology, privacy preserving tools like synthetic data and homomorphic encryption push us towards the efficient frontier of the “privacy versus utility” tradeoff due to the privacy protection they offer. Meanwhile, large language models push us towards this frontier from the utility side. Given their potential, we need to find a way to establish appropriate privacy paradigms for these tools.
The power of de-identified real-world data in healthcare research requires a thoughtful, comprehensive, and adaptable approach to patient privacy
Achieving meaningful insights while preserving privacy requires a clear understanding of the connection between broad research goals and concrete data considerations, along with a willingness to adapt to and take advantage of an ever-evolving health data landscape.