In our Ecosystem Explorer Series, we interview leaders from partner organizations who are improving access to real-world data. Today’s interview is with Jason LaBonte, CEO at Veritas Data Research.
Jason LaBonte is the chief executive officer of Veritas Data Research, where he is responsible for the overall management of the company and its operations. He is an executive with over 15 years of experience in leading healthcare information and technology companies, most recently as chief strategy officer at Datavant. Jason received his Ph.D. in virology from Harvard University, and his A.B. in molecular biology from Princeton University.
Founded by experts in the data analytics industry, Veritas Data Research uses cutting-edge technology and efficient workflow design to collect, curate, and distribute foundational reference datasets. Veritas makes critical information accessible to data and analytics teams across the healthcare vertical, as well as customers in the financial and insurance sectors.
Introduction to mortality data and why it’s important
Jason, welcome to the Ecosystem Explorer interview series! To start off, can you give us a quick overview of what mortality data is and why it’s important to researchers?
Mortality is a critical endpoint in health analytics — whether a patient survives their disease (or procedure), or has succumbed to it, should be one of the basic measures of treatment efficacy, public health policy, and protocol design. Unfortunately, unless a patient dies in a healthcare facility, this event is not well captured in the clinical datasets normally used in real-world data analytics, such as insurance claims or electronic health records.
Therefore, to determine the vital status of the patients in a study cohort, it is necessary to augment clinical real-world data (RWD) with a mortality dataset like Veritas’s Fact of Death offering. Very simply, this dataset has a record for every deceased individual in the United States that we can find, going back to 1935. For each record, we report who died (where each person is represented as a Datavant token set when deidentified), when they died, and where they died (at the zip code level).
By linking mortality data to clinical RWD, researchers can then determine which patients are alive or deceased, allowing them to build more accurate survival curves, measure mortality as an endpoint in health economics and outcomes research (HEOR) studies and in pragmatic trials, and better build synthetic control arms for use in interventional clinical trials.
And how is Veritas involved with mortality data? Tell us a little bit about your company and its mission.
Veritas was founded to make critical reference datasets much more accessible and, in so doing, to increase the utility of all clinical data. We believe that a dedicated focus on creating these datasets will result in higher quality, higher coverage data than what is often available as the “exhaust” from systems designed for other purposes. We also believe that vulnerable populations are often under-represented in the datasets used today, and part of our mission is to fill that data gap as well.
We started with mortality data because it is a vital endpoint that we felt was just too difficult to access. The data that was available to analysts had low coverage, a number of restrictions, and a lack of timeliness. Through Datavant, analysts could at least aggregate multiple mortality datasets to create something with coverage that was good enough to use, but they still had to do a lot of de-duplication and data cleaning.
At Veritas, we thought we could do a lot better with a focused effort. By sourcing, collating, and indexing mortality data from over 40,000 public, private, and government sources, Veritas has now built the most complete and timely mortality dataset on the market. And all of those records are delivered in a single dataset, so the user doesn’t need to do any aggregation or de-duplication work.
The challenges of mortality data
Intuitively, it feels it should be very easy in this day and age to find out if someone has died. Why is this so hard?
Every death is recorded by states in a death certificate, and those deaths are all aggregated at the CDC, so the data is out there. However, these government sources don’t allow access to individual-level records for commercial use cases. Even for research applications, these data sources are very hard to access — sometimes taking years to obtain. Unfortunately, even governmental agencies struggle to access this data for their work.
You mentioned the CDC aggregates mortality data — that’s a reference to the National Death Index (NDI), a centralized database of death record information compiled from state vital statistics offices. Could you talk more on the NDI’s constraints that would drive organizations to acquire other sources of mortality data?
Most use cases that are of interest to pharmaceutical companies, payers, and even providers are not allowed under the NDI’s charter. Of those that are allowed, we’ve been told the CDC prefers that the NDI data does not leave their systems, often requiring that a researcher’s data is sent to them for linkage to the NDI and analysis. With these constraints, most folks need to acquire mortality data outside of the NDI.
Let’s talk about the other sources of mortality data beyond the CDC. What are those, and are there any challenges associated with collecting and managing large volumes of mortality data?
Mortality data is available in a number of public places from obituaries to cemetery listings. However, these sources are numerous and fragmented, meaning it is a large effort to scour them all. Veritas, for example, examines ~40,000 sources across the United States to find mortality events.
Timely data is critical for many of the use cases we serve, so we need to find mortality events as fast as we can. That means that we need our collection processes to refresh our dataset every week, meaning we had to build a lot of automation in our data processing workflow.
And the data our system collects is raw and unstructured, so we have built an entire data extraction, cleaning, and standardizing workflow to take the mortality information we find and turn it into an analytics-ready dataset.
The curation process — turning raw data into structured, usable data — must be quite challenging, especially if you’re pulling from tens of thousands of sources. How do you approach curation to make mortality data useful for health organizations?
During our data curation process, we try to remove a lot of the work that researchers and our other customers would typically need to do. For instance, we work to standardize the data as much as we can using reference datasets. We remove special characters from first and last names, and then validate names against a names database to make sure only records with a real name are included in our deliverable. We validate locations against the USPS reference database and report out the standardized USPS value for city, state, and zip code.
And because we source data from so many different places, we will generally find a mortality record for the same person in multiple places. We have algorithms in place to de-duplicate those records, consolidating them into a single mortality record. Where we can, we use the multiple sources to fill gaps to create the most complete mortality record possible, and we can create a confidence score for each record in the process.
Can you share some of Veritas’ data sources?
Some of Veritas’s sources are online obituary announcements, funeral home notices, military & veterans cemetery listings, and the Social Security Administration’s Limited Access Death Master File (LADMF). We are continuously sourcing and adding incremental mortality data, and have been increasing our coverage rates every month.
It sounds like most of your data sources are open-source. Does mortality data have any unique challenges with data privacy and security?
Mortality data gathered from public sources is not considered protected health information (PHI), nor is it subject to consumer data regulations like GDPR or the California Consumer Privacy Act (CCPA). Instead, this form of mortality data would be categorized as personally identifiable information (PII). That said, our health customers in particular often need to link our mortality data with PHI, so we are well-versed in the process of deidentification and token-based linkage. In partnership with Datavant, our mortality data can be joined with any customer’s health data in a privacy-preserving manner.
We take data security very seriously. Our predominant workflow is that our data is delivered to the customer, who uses it within their environment. We support whatever method of file transfer they prefer, whether that is Secure File Transfer Protocol (SFTP) or data sharing within cloud providers like Snowflake or Databricks.
Looking to the future
We’ve talked about the opportunities and challenges of mortality data. Now let’s look to the future. How do you believe greater access to mortality data will improve healthcare?
Having access to mortality data will allow researchers to better document long-term survival statistics for clinical and longitudinal research. They will be able to more accurately measure the efficacy of new drugs or treatment protocols in real-world settings. They will be able to better model and identify high-risk patient populations to be able to intervene earlier with preventative care. And because our mortality dataset has better representation of vulnerable populations, these analyses will be more accurate for traditionally underrepresented groups.
What do you see as the most exciting opportunities for researchers and organizations working with mortality data in the coming years?
We are excited to extend our mortality data to cover the cause of death, and potentially the social factors associated with death. With the addition of cause of death, researchers will be able to tease apart death events that are related to the condition they are studying, and those that should not be included (e.g. removing patients who die in a car accident from a cancer survival curve). With social data, researchers can augment their analyses of mortality outcomes with the non-clinical factors that should be part of a risk or outcomes assessment — what could be considered the “social determinants of death”.
Are there any other innovations in this space that you are particularly excited about?
We are excited by the innovations surrounding the use of RWD in clinical trial settings, including pragmatic (RWD-only) studies, building synthetic control arms for interventional trials, and long-term monitoring of trial patients. We think mortality data should be a key component of each of these efforts, and we’ve worked hard to build our data with maximum transparency and traceability to comply with FDA’s emerging real-world evidence (RWE) guidance around data provenance.
Jason, thanks very much for the interview! Final question: If our readers want to learn more about mortality data, do you have any recommended resources or links?
Absolutely! For research studies using mortality data, check out the COVID-19 Research Database. Additionally, here is a comprehensive overview of the Veritas Fact of Death Index.
For Datavant customers who want to learn more about Veritas’ mortality data, our data is tokenized and available for exploration on the Datavant Portal. Interested organizations can conduct an overlap with our data whenever they would like.
For more detailed questions, you can reach out to us directly at Sales@veritasdataresearch.com.
This interview is part of our Ecosystem Explorer Series, in which we interview leaders from partner organizations who are improving access to health data. Contact us if you’re interested in participating in this series.