Special thanks to the following individuals for their help writing this article: Surafeal Asgedom, Michael Stebbins, David Shulkin, Niall Brennan, Tom Carton, Andrew von Eschenbach, and Charlie Rothwell.
***
In the US, there are over 2,000 government health data sets at the federal, state, and local levels, including several of the largest surveillance databases (such as FDA’s Sentinel and the CDC’s surveillance infrastructure) and one of the world’s largest claims databases (CMS data) and one of the largest electronic medical records datasets (VA data). Historically, government data has been created, stored, and analyzed in data silos. This fragmentation limits the utility of government data in answering critical public health questions. This article provides an overview of the landscape of government health data sources as well as the opportunity for a more effective public health infrastructure.
***
One of the challenges the COVID-19 pandemic has highlighted is the level of health data fragmentation in the United States. Millions of patients have had the disease, and vaccination efforts are underway, yet as a society we still struggle to understand fairly basic questions about the epidemiology of the disease, the long-term impact of the disease, the efficacy of various public health interventions, and the impact of the vaccine. For example, given the amount of data that has been collected at this stage, as a society we should be able to answer:
Going forward, as the vaccine is more widely distributed, questions about the efficacy of the vaccine, and the long-term impacts of the pandemic on both individuals and public health, should be able to be answered:
The challenge of data fragmentation doesn’t just apply to COVID-19, but to all diseases. There is a tremendous opportunity to use data to improve healthcare across diseases. Every time patients see a doctor, visit a pharmacy, check into a hospital, take a lab test, or pass away, there is information collected about the safety and efficacy of drugs, the epidemiology of diseases, and the health of populations that should not be lost. However, all of these disparate data points have limited utility when analyzed individually — it is when they are brought together that these data points form a full picture of the patient’s health. Each additional piece of data that can be linked together has the potential to exponentially increase the value of the data set for understanding key public health questions.
Thus, linking data across these silos is critical to solving the larger health problems that still plague us: expensive and redundant care, poor health outcomes (especially for disadvantaged populations), poor care coordination across agencies (especially when crossing state and Federal boundaries of responsibility), poor public health visibility, poor coordination of social benefits to those who need them to prevent health problems, and many more.
We’ve written extensively about how data is fragmented in the commercial data ecosystem, and created a map of the commercial data ecosystem. In this post, we focus on the challenge of data fragmentation at the government level. On top of the complexity of the commercial data ecosystem, there are more than over 2,000 data sets across federal, state, and local governments that incorporate different types of health information, ranging from lab test results to drug safety surveillance to data on socioeconomic determinants of health – and thousands more from NGOs, universities, and institutions that work closely with the public sector. Yet for all this data that is collected, there is very little connectivity across parts of the government, either at the federal, state, local, or tribal level.
Below is a sampling of government agencies and subagencies with health relevant data. This graphic is not comprehensive, and includes sample state and local agencies along with federal agencies, initiatives, and purpose-built datasets.
Each of the government agencies, subagencies, and initiatives listed above has its own goals, operational processes, and funding, and has developed operational processes to fulfill its individual mission. This specialization creates data silos, as well as duplicative data collection when different agencies have similar questions to answer.
Take for example Jane, a 70 year old living in a skilled nursing facility. When Jane receives a COVID-19 vaccine, that piece of information is captured by numerous different federal, state, and local agencies, each for their own purpose:
In this example, Jane’s vaccination status is relevant to at least five government agencies, but each collects the data separately as part of its operational processes or to support its own analytics. Each agency has invested in data collection, but still has an incomplete picture of Jane’s health. For instance, the CDC can track vaccination rates, but does not have information on whether Jane later receives treatment for the virus, which is a data point held by CMS.
Historically, government agencies have created specialized initiatives focused on a single disease area or public health issue to address this issue of redundancy and fragmentation. For example, to better understand the disparate impacts of COVID-19 on minority populations, the CDC began collecting data on patient ethnicity in August 2020, five months after the beginning of the pandemic. However, to preserve patient privacy, these initiatives collect only the minimum necessary information to fulfill their specific purpose. That mindset results in creating yet another data silo, custom-built to answer another limited set of questions.
Individual agencies or initiatives can help answer specific questions and solve problems in the immediate term, but the challenge is to respond to pan-health care questions in which no single data set is sufficient to support decision-making. For these “big questions”, data sets must be linked together to see the entire patient journey, inclusive of the environmental, social, and genetic factors that led to disease onset, through the entire care-path, to the long term outcomes for that patient.
To have a holistic view of a patient and their experiences, researchers need to be able to link the disparate data silos at a patient level, without compromising patient privacy. Patient privacy is a key challenge to data linkage because organizations are reluctant to share identified information (Protected Health Information) with other entities, even when they are other government agencies. To make these data exchanges more acceptable, institutions should consider whether to de-identify data before sending it; emerging cryptographic technologies in the domain of “privacy-preserving record linkage” can allow data to be linked while privacy is protected.
Expanding the use of data linking would enable the government to better understand our healthcare system and its delivery patterns, even beyond COVID-19. For example, duplicative provision of care may cost the U.S. healthcare system up to $78 billion a year. Linked data would enable the government to identify the types of services, both medical and social, that are most likely to be unnecessarily duplicated as patients move between disparate agencies.
Below, we’ve highlighted three sample areas where linked data could enable researchers to answer questions related to COVID-19’s impacts, oncology, and the opioid epidemic, and could enable the government to more effectively deliver interventions:
Linked data can be used to understand the long-term impacts of COVID-19 on patients. For example, one impact of COVID-19 in children is multisystem inflammatory syndrome (MIS-C). The long-term impacts of this syndrome are still unclear, but numerous data sources will capture relevant information:
Similar data sources can also be used to understand the impacts of long COVID in adults. Additional relevant data sets for long COVID could include:
Similarly, key questions about the opioid epidemic can be answered by linking together data sets across multiple agencies:
The Cancer Moonshot was designed to accelerate research into new therapies for cancer, as well as improving early detection and prevention of cancer. Data linking would enable future initiatives like the Cancer Moonshot to answer key questions about therapy effectiveness and cancer prevention:
The gaps in today’s system have been made clear by the COVID-19 pandemic. The inability for the CDC to link data across state public health agencies impeded the CDC’s ability to create dashboards to understand case loads across various geographies. Instead, the Johns Hopkins COVID-19 dashboard became the authoritative source for COVID-19 case numbers, as it efficiently aggregated the disparate state-level data silos.
To be clear, these gaps are driven by the inability to link data rather than gaps in data collection efforts. Each government agency and initiative has made foundational investments in collecting and understanding data in order to provide the most effective and efficient services to its constituents.
Today, this data collection at the federal level is governed by the Paperwork Reduction Act, which requires federal agencies to develop information collection requests for any data gathering process. As a result, the federal government has an inventory of data sets that have been collected over time and can easily analyze which data sets should be connected to one another. Connecting these data sets would also reduce the burden of data gathering activities on the public, one of the main goals of the Paperwork Reduction Act.
The next step to unlocking the power of this data is to link it across data sets in a privacy-protecting manner, which will enable researchers and policymakers to answer basic, foundational questions about how to best provide healthcare services today. Privacy-protecting data linkage across data sets will enable researchers to connect existing data sets to answer pressing questions with minimal need for additional data collection, and ensure that patients receive the benefits of linked data without compromising their privacy.
Public institutions have taken the first steps towards unlocking the power of linked data. For example, N3C, the National COVID-19 Cohort Collaborative, is linking data from disparate clinical sites to speed research into the COVID-19 pandemic. The All of Us program has linked data across fragmented EHR records to understand which types of patients are more likely to receive fragmented or incomplete care.
If more government institutions make their data linkable with other government institutions, we can dramatically increase the speed at which researchers can find answers to questions about the COVID-19 pandemic. But the power of linked data is not confined to this pandemic; instead, linked data can be used to ensure that patient cohorts are well understood in their complexity, so that targeted and meaningful interventions can be made to improve public health for the many chronic conditions prevalent in the United States. Solving the government health data fragmentation challenge, while still protecting patient privacy, will dramatically improve patient outcomes across the United States and the world.
¹ https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)32220-0/fulltext
² Pew Charitable Trusts: https://www.pewtrusts.org/en/research-and-analysis/articles/2019/12/09/improved-provider-coordination-can-reduce-health-care-cost
³ https://pra.digital.gov/burden/
AnalyticsIQ, a marketing data and analytics company, recently adopted Datavant’s state de-identification process to enhance the privacy of its SDOH datasets. By undergoing this privacy analysis prior to linking its data with other datasets, AnalyticsIQ has taken an extra step that could contribute to a more efficient Expert Determination (which is required when its data is linked with others in Datavant’s ecosystem).
AnalyticsIQ’s decision to adopt state de-identification standards underscores the importance of privacy in the data ecosystem. By addressing privacy challenges head-on, AnalyticsIQ and similar partners are poised to lead clinical research forward, providing datasets that are not only compliant with privacy requirements, but also ready for seamless integration into larger datasets.
"Stakeholders across the industry are seeking swift, secure access to high-quality, privacy-compliant SDOH data to drive efficiencies and improve patient outcomes,” says Christine Lee, head of health strategy and partnerships at AnalyticsIQ.
“By collaborating with Datavant to proactively perform state de-identification and Expert Determination on our consumer dataset, we help minimize potentially time-consuming steps upfront and enable partners to leverage actionable insights when they need them most. This approach underscores our commitment to supporting healthcare innovation while upholding the highest standards of privacy and compliance."
As the regulatory landscape continues to evolve, Datavant’s state de-identification product offers an innovative tool for privacy officers and data custodians alike. By addressing both state-specific and HIPAA requirements, companies can stay ahead of regulatory demands and build trust across data partners and end-users. For life sciences organizations, this can lead to faster, more reliable access to the datasets they need to drive research and innovation while supporting high privacy standards.
As life sciences companies increasingly rely on SDOH data to drive insights, the need for privacy-preserving solutions grows. Data ecosystems like Datavant’s, which link real-world datasets while safeguarding privacy, are critical to driving innovation in healthcare. By integrating state de-identified SDOH data, life sciences can gain a more comprehensive view of patient populations, uncover social factors that impact health outcomes, and ultimately guide clinical research that improves health.
Both payers and providers are increasingly utilizing SDOH data to enhance care delivery and improve health equity. By incorporating SDOH data into their strategies, both groups aim to deliver more personalized care, address disparities, and better understand the social factors affecting patient outcomes.
Payers increasingly leverage SDOH data to meet health equity requirements and enhance care delivery:
Payers’ consideration of SDOH underscores their commitment to improving health equity, delivering targeted care, and addressing disparities for vulnerable populations.
Capital District Physicians’ Health Plan (CDPHP) incorporated SDOH, partnering with Papa, to combat loneliness and isolation in older adults, families, and other vulnerable populations. CDPHP aimed to address:
By integrating SDOH data, CDPHP enhanced their services to deliver comprehensive care for its Medicare Advantage members.
Value-based care organizations face challenges in fully understanding their patient panels. SDOH data significantly assists providers to address these challenges and improve patient care. Here are some examples of how:
By leveraging SDOH data, providers gain a more comprehensive understanding of their patient population, leading to more targeted and personalized care interventions.
While accessing SDOH data offers significant advantages, challenges can arise from:
To overcome these challenges, providers must have robust data integration strategies, standardization efforts, and access to health data ecosystems to ensure comprehensive and timely access to SDOH data.
With Datavant, healthcare organizations are securely accessing SDOH data, and further enhancing the efficiency of their datasets through state de-identification capabilities - empowering stakeholders across the industry to make data-driven decisions that drive care forward.
Explore how Datavant can be your health data logistics partner.
Contact us