18 November 2020

Why Privacy-Preserving Record Linkage is Having a Moment: Interview with Abel Kho

This past weekend, the Donald A.B. Lindberg Award for Innovation in Informatics presented by the American Medical Informatics Association (AMIA) was awarded to Abel Kho for his pioneering work and research applying privacy-preserving record linkage (PPRL) methods to healthcare.

Abel Kho, MD, MS, FACMI

Abel is a practicing physician and faculty member at Northwestern University’s Feinberg School of Medicine, as well as the Director for the university’s Center for Health Information Partnerships and Director for the Institute for Augmented Intelligence in Medicine.

Abel, first of all — congratulations on receiving this award, what an honor. We’re thrilled for you.

Thank you Joyce, it was a bit of a surprise, but a very happy surprise, a welcome change in a year full of surprises.

Could you start by defining privacy-preserving record linkage (PPRL) for us? What is the innovation that it represents, and how did you get started on this journey?

PPRL is a way of preserving the uniqueness of a person, while at the same time allowing that person’s information to be linked across institutions. It’s the best of both worlds — it gives you the opportunity for connecting data, while still making sure you preserve the private information of the individual.

We started trying to solve this fundamental problem 10 years ago. In many places in the U.S., there isn’t a consistent way of exchanging information across institutions for patient care or research. We explored some tools out there, and we realized there were some methods in the computer science community that use encryption methods with hashing. Hashes are widely used in internet transactions, and can be used to create an irreversible, but unique code given a set of inputs (e.g. patient features like last name, first name). We wanted to experiment with applying that to healthcare.

We started with a specific clinical research use case: what was the burden of chronic diseases like diabetes in Chicago? The only way to accurately do that was to bring together data across sites, so that we wouldn’t over-count or double count if the same person with diabetes was seen at different institutions. So we experimented with hashes of the same person across different sites, which allowed us to know whether two records belonged to the same person, but not reveal who that person was (the irreversible part of hashes).

And this allowed us to get accurate counts of things like — how many people are there with diabetes across Chicago? How many people who have had heart attacks, or asthma, etc. And we were able to solve this basic problem of over-counting people across sites in Chicago through PPRL. Which is really a problem especially in dense population centers like cities, where there’s a lot of choice that people have in where they can receive healthcare.

It’s fascinating how a basic problem in healthcare can be solved by applying a method or technology from a different industry. Why do you think PPRL has potential in healthcare, especially?

Especially during a pandemic, there’s a growing awareness that there are so many other things that we do that affect our health — not just where we receive care and what care we get — but where we live, the people we interact with, where we go to school. As we realize there are all these externalities that affect health, it’s increasingly important to be able to link data across different sectors to understand what’s driving good or bad health outcomes.

As we create more and more data in our daily lives, that opportunity grows. In our daily lives we generate all sorts of data — the ability to bring thata data together to get the full picture of our health has never been more important now and more possible now.

Can you tell us about some of the research and evidence milestones that you and/or colleagues have achieved along the way in proving the value of PPRL? What studies come to mind?

The first one that comes to mind is a project in which we linked data for the Veterans Affairs (VA) system. The real challenge there was navigating one of the more secure health data systems out there. It was a real proof point that this method could be acceptable, even in a really restrictive environment. The VA researchers actually approached us and asked if we could help solve this problem, to answer this question of how many veterans receive care outside the VA. And they couldn’t answer it without linking data across sites, across institutions. So we were able to work with them to apply PPRL there, and it turns out that over 20% of veterans receive care outside the VA over the course of a year.

We’ve done linkage studies on cohorts from the All of Us research program, which also required linkage across sites. That one really underscored the fact that people may receive care from different institutions, and if you really want to get the full picture, you need to span the different sectors.

We did a study on linking data in order to recruit patients for a clinical trial, the ADAPTABLE study run by PCORnet. And we showed that you could check whether someone is eligible for a clinical trial by working with community organizations like churches.

Another one — this was a pilot study — showed that we could link registries across states. We worked with four state cancer registries to demonstrate that it’s possible, and you could identify patients who received care in multiple institutions because they had multiple different kinds of cancer.

Why do you think PPRL is important now, in particular?

An aphorism that’s used often in data exchange is that “data follows trust.” By using methods that preserve privacy, it’s a technical means to engender or encourage trust. In a time where no one wants to work with anybody and we’re in a fragmented society, it’s important to do everything we can do to engender trust across communities, and bring people together. We’re having a moment right now — there have never been worse levels of distrust and societal fragmentation. PPRL can help a little bit to span those gaps — between people, between ideas. It’s the right time for it.

We’ve seen a tremendous amount of interest in using these methods for public health. That’s been demonstrated by the adoption of PPRL by the National Patient-Centered Clinical Research Network (PCORnet) and the National COVID Cohort Collaborative (N3C) to link data across hundreds of institutions.

And especially in a pandemic, there’s another level of urgency for that now, right?

Definitely. I teach a class on Wednesdays, the class is mostly physicians — and I asked them, is it okay to give up a certain amount of privacy in a pandemic? And most said yes, but of course it’d be better not to have to give that up.

In a pandemic you need to track who’s infected, who was immunized, etc. and with PPRL, again this is a situation where you can get the best of both worlds. You can still calculate the counts and know who is affected, but don’t have to generate a privacy risk.

There’s a recent article in Nature that talks about the importance of accurate record linkage for fighting the pandemic, and PPRL methods could be part of that solution.

What do you think is needed for PPRL to scale and become more useful to society?

There’s definitely a critical mass element to it. The more data that’s available for linking, the more sites and domains and people you can touch, the more valuable that amalgamation of data becomes. It’s like gravity: with a large enough mass it becomes super powerful. With a smaller mass, the pull isn’t as great. The greater the number of linkages, the greater the number of insights you can generate.

To make that possible — we need to make it super easy to use. We can still do a better job of helping people to understand what it is. And we need more proof points, domain-specific and friendly examples that resonate with people. Let people use the tool to be as creative as they can be — if you put these tools in people’s hands, they’ll find creative ways to use it. We need more people playing with the tools and in different domains. I definitely think it’s under-explored, the cross-sector linkages.

In the pandemic, for example: could you link the household — perhaps through emergency contact info in school records — to known cases of COVID, to identify what is the actual burden of COVID at a given school. That’s the kind of detailed information that’s super useful. As a parent, I’d love to know whether there are 1 or 100 households with positive cases in my kid’s school. That’s a cross-sector linkage, and you could do that in a way that doesn’t violate people’s privacy — just get a count, and it gives you information that can help you make a decision. But it requires sparking the imagination of someone in a school, and someone in public health.

When I talk about examples like this, most people say — that’s not possible, that would never happen. And then you explain it [PPRL] to them, then they realize it’s possible, that it’s a real thing… and then they start imagining and realizing all the possibilities.

Why Privacy-Preserving Record Linkage is Having a Moment: Interview with Abel Kho was originally published in Datavant on Medium, where people are continuing the conversation by highlighting and responding to this story.