How does a researcher connect a patient’s hemoglobin test results with their insulin prescription history with their vital signs measured by wearable technology? Advances in healthcare technology and the accompanying abundance of digital healthcare data can provide researchers with rich views into the holistic patient journey, in this case a patient who may be at risk for diabetes. However, getting to this holistic view is very difficult in today’s real-world data landscape — unless there is confidence that the patients in these datasets can be successfully matched.
Real world datasets are imperfect. They often contain invalid or missing data and patient demographic fields can vary across datasets. Intelligently and accurately stitching together patient records across data sources is therefore paramount for analyses that involve multiple sources of real world data, meet a research-grade standard and require regulatory approval.
The starting point of patient matching is always some form of pairwise matching logic — that is, a criterion that determines whether a pair of records correspond to the same patient. There are a variety of criteria one can use, ranging from deterministic logic such as a rule that links two records if they have the same name and date of birth, to more complex algorithms with probabilistic or machine learning components.
Limitations of pairwise comparisons
Consider the following table of patient data:
Suppose we use the pairwise comparison rule that two patient records correspond to the same patient if they either:
An evaluation of pairs of records in isolation can be represented with pairs of nodes joined by edges, as shown below. The numbers in the nodes correspond to Record IDs in the table, and two nodes are connected if they satisfy the pairwise matching criteria.
This representation suggests that records 1 and 2 should have the same patient ID, as should records 2 and 3. Continuing with this logic suggests that all 6 records should have the same patient ID. The alternative is to make an arbitrary determination as to how to assign patient IDs.
This problem becomes increasingly intractable in real world situations in which we may have hundreds of patient records that are all sequentially connected. Assigning the same patient ID to all such records can result in numerous incorrect patient associations, leading to incorrectly merging patient records and potentially putting patient care at risk. The value of the patient match graph is the added precision achieved by layering it on top of pairwise comparisons.
Advantages of using match graphs
Patient Matching graphs are a data structure that account for multi-record interrelationships by building on pairwise comparisons. Matching algorithms that make use of them boost matching accuracy over approaches that consider each pair of records independently. In particular, the increase in precision is apparent when working with datasets with missing values.
A patient match graph is a representation of patient data that showcases the relationships among patient records. It consists of nodes, with each node representing a single patient record. Two nodes or patient records are connected by an edge if the patient records share enough data to possibly be considered a match; in this way, each edge of the graph represents pairwise comparisons.
Continuing with our example of the patient records in Table 1, we can use the same criteria for connecting a pair of records that we used when considering pairs in isolation: requiring that a pair of records share either the same last name or at least two demographic fields. This leads to the representation below.
Since we have represented the data as a graph, we can use graph theory to inform our assignment of patient IDs. In the match graph, we can see that there appear to be two clusters of records — one consisting of records 1, 2, and 3, and a second consisting of records 4, 5, and 6. These two clusters are connected by a single “bridge” edge between records 3 and 4. Therefore, we make the decision to cut this bridge edge, and assign patient ID “A” to records 1, 2, and 3, and patient ID “B” to records 4, 5, and 6.
Understanding the complete set of patient interrelationships is a vital piece of assigning accurate patient level IDs. In this example, although 12345 is a real zip code, it could also have been a filler value in this dataset.
Representing the patient records in the form of a match graph enabled a view of the patient population conducive to an algorithm that made the sensible decision of splitting records 1, 2, and 3 from records 4, 5, and 6. Real world data is messier and more complex than our illustrative example, and population level considerations become even more critical for patient matching.
In practice, the manual inspection of the match graph in our example can be implemented in the form of a programmatic algorithm. The result is a means for assigning consistent patient IDs across datasets that is robust in the face of real world data.
Editor’s note: This post has been updated on December 2022 for accuracy and comprehensiveness.
AnalyticsIQ, a marketing data and analytics company, recently adopted Datavant’s state de-identification process to enhance the privacy of its SDOH datasets. By undergoing this privacy analysis prior to linking its data with other datasets, AnalyticsIQ has taken an extra step that could contribute to a more efficient Expert Determination (which is required when its data is linked with others in Datavant’s ecosystem).
AnalyticsIQ’s decision to adopt state de-identification standards underscores the importance of privacy in the data ecosystem. By addressing privacy challenges head-on, AnalyticsIQ and similar partners are poised to lead clinical research forward, providing datasets that are not only compliant with privacy requirements, but also ready for seamless integration into larger datasets.
"Stakeholders across the industry are seeking swift, secure access to high-quality, privacy-compliant SDOH data to drive efficiencies and improve patient outcomes,” says Christine Lee, head of health strategy and partnerships at AnalyticsIQ.
“By collaborating with Datavant to proactively perform state de-identification and Expert Determination on our consumer dataset, we help minimize potentially time-consuming steps upfront and enable partners to leverage actionable insights when they need them most. This approach underscores our commitment to supporting healthcare innovation while upholding the highest standards of privacy and compliance."
As the regulatory landscape continues to evolve, Datavant’s state de-identification product offers an innovative tool for privacy officers and data custodians alike. By addressing both state-specific and HIPAA requirements, companies can stay ahead of regulatory demands and build trust across data partners and end-users. For life sciences organizations, this can lead to faster, more reliable access to the datasets they need to drive research and innovation while supporting high privacy standards.
As life sciences companies increasingly rely on SDOH data to drive insights, the need for privacy-preserving solutions grows. Data ecosystems like Datavant’s, which link real-world datasets while safeguarding privacy, are critical to driving innovation in healthcare. By integrating state de-identified SDOH data, life sciences can gain a more comprehensive view of patient populations, uncover social factors that impact health outcomes, and ultimately guide clinical research that improves health.
Both payers and providers are increasingly utilizing SDOH data to enhance care delivery and improve health equity. By incorporating SDOH data into their strategies, both groups aim to deliver more personalized care, address disparities, and better understand the social factors affecting patient outcomes.
Payers increasingly leverage SDOH data to meet health equity requirements and enhance care delivery:
Payers’ consideration of SDOH underscores their commitment to improving health equity, delivering targeted care, and addressing disparities for vulnerable populations.
Capital District Physicians’ Health Plan (CDPHP) incorporated SDOH, partnering with Papa, to combat loneliness and isolation in older adults, families, and other vulnerable populations. CDPHP aimed to address:
By integrating SDOH data, CDPHP enhanced their services to deliver comprehensive care for its Medicare Advantage members.
Value-based care organizations face challenges in fully understanding their patient panels. SDOH data significantly assists providers to address these challenges and improve patient care. Here are some examples of how:
By leveraging SDOH data, providers gain a more comprehensive understanding of their patient population, leading to more targeted and personalized care interventions.
While accessing SDOH data offers significant advantages, challenges can arise from:
To overcome these challenges, providers must have robust data integration strategies, standardization efforts, and access to health data ecosystems to ensure comprehensive and timely access to SDOH data.
With Datavant, healthcare organizations are securely accessing SDOH data, and further enhancing the efficiency of their datasets through state de-identification capabilities - empowering stakeholders across the industry to make data-driven decisions that drive care forward.
Explore how Datavant can be your health data logistics partner.
Contact us