Hackathon Preview: Synthetic Data Within the Ecosystem of Healthcare Innovation
More innovation, fewer guardrails
From September 8–11, 2022, Datavant will host our first annual Future of Healthcare Hackathon. To get participants more deeply engaged with some of the cutting edge technologies currently making waves in healthcare technology, we are collaborating with several industry partners to provide datasets for use in projects. In addition to price transparency data by Turquoise and a variety of datasets related to provider/payer information sharing provided by Datavant, Syntegra is providing synthetic datasets.
We had a conversation with Syntegra’s Head of Growth, Carter Prince, about synthetic data’s role within the healthcare ecosystem and how it reduces privacy risk while increasing data utility to improve research and drive innovation.
We’re very excited to have this opportunity to collaborate with Syntegra, and to share synthetic data sets as part of the Future of Healthcare Hackathon. Can you summarize what synthetic data is within the world of healthcare and how it can be useful?
Synthetic data looks and acts just like real data, maintaining all of the statistical accuracy of the original data but containing fake (synthetic) patients. This protects patient privacy even beyond the regulatory guardrails of HIPAA or GDPR because the patients being statistically represented don’t actually exist. As a result, healthcare data can be used and shared much more easily and quickly without facing the typical privacy and administrative barriers.
Beyond the element of privacy, synthetic data greatly increases the utility of healthcare data. It allows for the use of more granular, patient-level data, which can also be augmented and customized, such as by increasing population size or addressing areas of bias. Synthetic data can be used for a variety of use cases, including to improve the accuracy of algorithms by providing the volume and type of data needed for model development and testing, to expand rare or small cohorts to improve precision medicine research, to enable access to hard-to-obtain EU or rare disease data, and so much more.
Synthetic data is driving innovation, especially in the digital health space. Synthetic data has the unique capability of providing access to large amounts of diverse, representative, patient-level data, addressing a great need in the development and testing of AI/ML models to improve their accuracy in real-world settings. Digital health and health tech companies, especially those in their earlier stages, often struggle to find the right data, a process that can take months or well over a year, if they can access it at all. And when real data is accessible, it is often stripped of important fields in order to maintain patient privacy, which greatly reduces its utility. Syntegra’s synthetic datasets give immediate access to the data these companies need to build and test new products, significantly accelerating deployment of these tools. Through our partnership with Tuva Health, we also provide an analytics-ready format of both our EHR and claims datasets, removing a lot of the guesswork and time it takes to process healthcare data from its raw form, making it more usable for analytics and AI/ML.
Outside of digital health, we are also working with pharmaceutical organizations to leverage the rapid access and flexibility of synthetic data to help internal teams, such as real-world evidence teams, better explore datasets for feasibility and study design before conducting a final analysis on actual patient data.
We believe high-fidelity synthetic data will become part of the “healthcare data stack” for all healthcare organizations.
How has the use of synthetic data evolved within the healthcare industry?
The use of synthetic data in the healthcare industry is still relatively new, as previous approaches to synthetic data generation have been largely unsuccessful, limiting its adoption. Early methods used rules-based approaches and suffered from low accuracy. More recently, generative adversarial networks have been used with simple, tabular data, but they fail to capture the full complexity of healthcare data.
Syntegra uses a really groundbreaking machine learning approach, transformer-based language models, to generate synthetic data, allowing us to create complex, longitudinal healthcare data and work with all types of structured data in any data format. Our model, the Syntegra Medical Mind, learns the underlying distribution of real health care data (such as EHR, claims, genomics, and more) represented as a temporal sequence of medical events, then uses the learned distribution to generate completely new (synthetic) patient records. Learn more about our approach and challenges with early methods on the Syntegra blog.
Trust in and use of synthetic data are growing, as its fidelity and utility continue to improve, and its capabilities and potential become more well known. Syntegra’s language model approach allows us to work with longitudinal healthcare data, capture full scale and dense medical history and maintain multivariate accuracy. We’ve also developed a set of metrics for validating both the fidelity and privacy of synthetic data to ensure a high-level of accuracy and privacy preservation in our synthetic data.
Syntegra and Datavant are working in complementary roles with regard to making healthcare data more widely accessible. Given such partnerships, what do you imagine the healthcare data landscape will look like in the future?
We believe high-fidelity synthetic data will become part of the “healthcare data stack” for all healthcare organizations. There will always be a need for work with real patient data with a traceable provenance, an area in which we see Datavant as a current and growing leader, but the use of this data can be complemented and informed by the use of synthetic data. Open-ended exploration, for example, is often impossible with de-identified real data due to well-deserved patient privacy restrictions. Synthetic data, however, can be used in this way, presenting an opportunity for teams to be more data-driven in the early stages of hypothesis and study design, or product testing and development. Insights at this early stage can then be taken further in real datasets, and synthetic data can then serve as a complement to the real data by filling any gaps where necessary. We recently worked with a global pharma company to leverage synthetic data as a way to directly access an EU datasets they wouldn’t have been able to otherwise access, enabling them to gain a deep understanding of the underlying data structure and statistics, improving and accelerating future real-world evidence studies with this data.
Thanks, Carter! We’re looking forward to seeing some amazing Hackathon entries.
About the Future of Healthcare Hackathon:
Datavant has hosted several hackathons over the past few years. One major highlight of these was the 2020 Pandemic Response Hackathon, which drew over 1600 participants, 230 submissions, and involved 30+ co-partners. Have a look at the 2020 project showcase to see some especially impressive submissions.
The Future of Healthcare Hackathon is a virtual event taking place from Sept. 8 — Sept. 11. Submissions will be reviewed by our judging panel including David Shulkin, prior U.S. Secretary to the VA, Niall Brennan, Chief Analytics and Privacy Officer at Clarify (formerly at the Healthcare Cost Institute), Clare Bernard, Ph.D., Senior Director, Data Sciences Platform at Broad Institute, and more.
Winners can bring their projects to life by leveraging our prize pool, which includes cash prizes and the opportunity to travel to Washington D.C. to present at the annual Future of Health Data Summit (on 9/15). Presenters at this conference will include Former and Current Heads of the FDA, Former U.S. Secretary of the VA, Chief Data Officer of Broad institute, and Federal CIO. ~250 high profile leaders in healthcare, tech, policy, will be in attendance, as well as press in attendance.