Clinical trials are the backbone of medical innovation and growing rapidly, with more than 477,000 studies registered as of 2024 - a 16 percent increase from the number of studies registered just two years prior.
As life sciences organizations strive to generate high-quality evidence more efficiently, trial tokenization has emerged as a key strategy to enhance insights from clinical trials with real-world data (RWD) while maintaining patient privacy.
Tokenization enables the de-identification and privacy-preserving linkage of disparate sources of patient data. Sponsors are empowered to connect trial participants to external healthcare data sources—such as electronic health records (EHRs), claims data, pharmacy records, and lab results—without exposing personally identifiable information. In clinical research, this approach is unlocking more efficient ways to enhance long-term patient follow-up, confirm medical history, and drive novel research insights.
The volume of trials that Datavant is tokenizing has grown nearly 300% since 2022, reaching approximately 270 trials by the end of 2024.
Based on our work with top life sciences organizations, it’s clear that tokenization is shifting from an emerging concept to a foundational practice in clinical development — with some companies now making it a default for all new trials.
Figure 1: Total volume of trials Datavant has been contracted to tokenize, by year.
As interest in trial tokenization and RWD linkage continues to accelerate, Datavant analyzed trends across segments of the trials we’re tokenizing to uncover how life sciences organizations are adopting these practices today—and where the industry is headed next.
Drawing on our proprietary data and real-world sponsor use cases, this blog post delivers a forward-looking view into the evolving role of tokenization in clinical development, including:
The key trends driving theadoption of trial tokenization,
The therapeutic areas at the forefront of tokenization,
How sponsors are maximizing the value of their tokenized data with linkage,
Best practices in trial tokenization, and
What’s next in tokenization.
The Rise of Trial Tokenization: Benefits & Industry Adoption
Tokenizing clinical trial data can transform it into a privacy-preserving, linkable dataset, providing the opportunity to develop richer datasets, perhaps even surpassing the original trial data in value.
Once generated, tokens from clinical trial cohorts can be used to link clinical trial data to additional patient records and third-party datasets like medical claims, genomics, and retail pharmacy data –providing sponsors with valuable insights into patient journeys, treatment outcomes, long-term health impacts, and future research applications.
What’s spurring the acceleration of trial tokenization and RWD linkage?
Tokenization and RWD linkage unlock critical insights for sponsors and partners throughout the clinical research lifecycle.
From molecule to medicine, life sciences organizations are accelerating adoption of tokenization to:
Enhance evidence generation – By linking trial data with RWD from the outset, sponsors can validate medical history, assess long-term treatment effectiveness, and better understand disease progression and treatment journeys. Sponsors that implement tokenization at trial kickoff are empowered to use their trial tokens immediately, initiating early data partner exploration for accelerated downstream linkage and confirming medical history information such as vaccination events and medications.
Enable long-term patient follow-up – Tokenization and RWD linkage reduce long-term site and patient burden by allowing sponsors to study efficacy and safety outcomes via passive data collection beyond trial timelines.Tokenization also provides researchers the opportunity to understand why patients are lost to follow-up through mortality and social determinants of health (SDOH) linkage.
Support regulatory and payer requirements – With increasing demand for post-marketing commitments (PMCs), post-marketing requirements (PMRs), and long-term safety data, tokenization supports comprehensive evidence collection across trial and non-trial populations.
Bridge follow-up into commercialization – Maintaining visibility into post-trial therapy switching, prescribing behavior, adherence, and disease progression supports label expansion and comprehensive patient journey mapping. This can result in accelerated speed to insight and earlier revenue generation.
Optimize research efficiency – We’re seeing sponsors leverage tokenization to generate and de-duplicate control arms, detect professional patient enrollment, and integrate existing datasets, significantly reducing study costs and duplicative data spend.
Over time, as organizations increase their number of tokenized trials, they can centralize and leverage the data to optimize future trial design, improve estimations of event rates, and accelerate natural history studies.
Industry Adoption Trends
From our analysis of the more-than-200 trials Datavant is tokenizing, several key observations emerge:
Tokenization is expanding across all trial phases, including early-phase studies (Phase I and II) in which patient populations are smaller and each data point is highly valuable.
Large pharma, biotech, and diagnostic companies are adopting tokenization and data linkage for different strategic goals, from new molecular entities with the high potential for label expansion to regulatory compliance, payer requirements, and commercial differentiation.
Among life sciences companies, tokenization continues to be prioritized for large studies, studies requiring long-term follow-up, and new therapies for which early and extensive healthcare resource utilization insights are critical.
Rare diseases and metabolic disorders are emerging areas of interest for trial tokenization.
Figure 2: Common use cases for trial tokenization Datavant has observed.
Key takeaway: Tokenization and RWD linkage are no longer a niche practice – they are becoming essential tools for modern clinical development.
Key Trends in Trial Tokenization
Trend 1: Therapeutic Areas Leading Tokenization Adoption
Figure 3. Therapeutic areas represented across 115 of the more than 250 clinical trials contracted for tokenization by Datavant. The ‘Other’ category represents a summation of TAs that independently represented <5% of trials, including asthma/allergy, cardiovascular, hepatobiliary, dermatology, genitourinary, sleep disorders, and women’s health.
According to our data, the top three therapeutic areas where tokenization is most prevalent are:
Psychiatric Disorders – Psychiatric trials may seem a bit surprising as the top therapeutic area in which we see tokenization activity, as they traditionally rely on patient- and clinician-reported outcomes, which are not always captured or readily available in structured RWD. That said, gaining acomprehensive understanding of these patient populations requires complex patient journey mapping across multiple treatment modalities and health care settings.
The growing availability of real-world behavioral health data is a key driver of tokenization growth, and Datavant’s extensive health data ecosystem—spanning 300+ RWD partners—uniquely enables targeted psychiatric research. Our partners include key providers of specialized behavioral health data to answer critical research questions in this space.
We’re now seeing sponsors recognize tokenization as a modality to document historical treatment pathways and therapy-switching patterns for conditions such as schizophrenia, depression, and bipolar disorder, creating a more complete picture of patient care.
Obtaining informed consent and having IRB oversight is critical when using trial tokenization and RWD linkage in any therapeutic area. This is especially important in studies of psychiatric disorders due to the heightened sensitivity of mental health information and the potential for stigma. Ensuring participants fully understand how their data will be de-identified, linked, and used safeguards their autonomy and privacy, builds trust, and upholds ethical standards essential to research involving vulnerable populations.
Screening & Diagnostics – Companies developing diagnostic tests and screening tools are increasingly using tokenization to validate test performance in real-world settings. By linking early diagnostic data to longitudinal patient health records, sponsors assess the impact of early detection on long-term health outcomes, treatment decisions, and cost-effectiveness.
Oncology – Long-term follow-up, which stretches 10-15 years for many new cell and gene therapies, is essential in oncology research, alongside treatment pattern analysis and overall survival assessment. Tokenization plays a key role in enabling sponsors to link clinical trial data with mortality records, EHRs, and imaging data, enhancing regulatory submissions and post-market monitoring. This methodology can result in significant cost reduction for overall long-term follow-up support costs.
Emerging Areas of Interest:
Rare diseases – Rare disease trials face unique challenges due to their small patient populations. With these trials, every enrolled participant offers a critical source of data. Tokenization allows sponsors to use RWD to understand disease progression, treatment durability, and long-term health outcomes while reducing the burden of excessive site visits, ensuring studies collect meaningful real-world evidence even with limited patient numbers.
Tokenization further supports the development of external control arms, providing a scientifically rigorous comparator while ensuring that all patients in the trial have access to the investigational therapy.
Metabolic disorders (diabetes, obesity) – Metabolic disorders require long-term treatment monitoring to assess effectiveness and safety beyond the clinical trial setting, where tokenization and data linkage can playa key role. Given the high prevalence of existing or emerging comorbidities in trials for these common diseases, long-term RWD linkage can enable the analyses needed to uncover unexpected drug effects in new disease areas earlier, identifying associations that were not originally the focus of the clinical trials.
For example, recent epidemiological studies suggest that GLP-1 receptor agonists, such as semaglutide, may be linked to a reduced risk of Alzheimer’s disease, sparking interest in their potential neuroprotective effects. If validated in larger studies, these findings could accelerate drug repurposing efforts, guiding clinical trials and expanding therapeutic applications beyond their original indications.
Trend 2: Expansion of Tokenization across All Clinical Trial Phases
Traditionally, tokenization was most common in late-stage trials (Phase III & IV), where large patient populations and regulatory requirements make real-world data integration a natural fit –- and often, a necessity. However, adoption is now increasing in earlier-phase studies (Phase I & II), particularly for rare diseases and personalized medicine, enrolling patients early in the development cycle with every data point critical for assessing treatment efficacy.
Figure4. Distribution of clinical trial phases, as reported by sponsors or listed on ClinicalTrials.gov, for ~50 of the trials under Datavant contracts.
Phase I and II trials: Sponsors are using tokenization and RWD linkage to understand disease progression and validate real-world endpoints before pivotal trials begin. For innovative studies, when patients are enrolled, particularly in cell and gene therapy, tokenization enables sponsors to follow patient journeys from early-phase through post-approval use, ensuring a continuous flow of data that supports commercial differentiation and regulatory decision-making.
Phase III and IV trials: Tokenization enables data enrichment, post-marketing studies, and label expansions by enabling RWD integration. Later-phase trials with large participant pools are most likely to yield a high proportion of RWD matches, enabling robust post-trial analysis. As payers and regulators continue to emphasize long-term safety and effectiveness— and are now recognizing the utility of linked RWD as part of that evidence generation process —tokenization is becoming an ever more essential tool for understanding healthcare utilization, treatment adherence, and patient outcomes.
Takeaway: Pharma companies are beginning to invest in tokenization earlier in the drug development lifecycle to streamline future studies within drug programs and across therapeutic areas. By integrating insights early, sponsorsset the stage for real-world data to be seamlessly incorporated into their integrated evidence plan, helping to better prepare for unexpected questions or evolving regulatory expectations, and improving study design.
Trend 3: Who is Tokenizing? Sponsor Adoption Trends
Figure 5. Sponsor types represented across all 250+ clinical trials contracted for tokenization by Datavant. Sponsor categories are defined as: Top 20 Pharma: Large, multinational pharmaceutical companies ranked among the top 20 globally by annual revenue. Top 21–50 Pharma: Mid-sized pharmaceutical companies ranked between 21st and 50th globally by annual revenue. Biotech: Companies primarily engaged in the research and development of innovative therapies—often biologics, gene therapies, or targeted treatments—using advanced biological or genetic technologies. Other: Organizations involved in the clinical research ecosystem but not primarily focused on developing therapeutic or diagnostic products.
Goals and use cases we observe for tokenization tend to vary by company size.
Top 20 pharmaceutical companies are scaling tokenization efforts across their portfolios, using linked datasets to optimize research costs and accelerate regulatory approval.
Mid-sized and early-stage biotechs leverage tokenization and data linkage to maximize insights from small patient populations, ensuring every participant’s data is fully utilized.
Diagnostics and screening companies rely on tokenization to test performance in real-world settings, meeting payer and regulatory requirements.
Looking Ahead: The Future of Trial Tokenization and RWD Linkage
Best Practices for Implementing Tokenization and RWD Linkage Successfully
Start early in trial design. Tokenization should be integrated into trial planning from the outset. By aligning tokenization strategies with protocol development, informed consent processes, and data collection, sponsors lay the groundwork for a smooth operational implementation and for future linkages to be feasible and impactful.
Ensure a privacy-first approach. A robust privacy framework is essential for ensuring regulatory compliance and patient trust. Sponsors should work with trusted tokenization partners that adhere to the highest standards of data security, governance, and de-identification.
Engage stakeholders across the organization. Tokenization is not just for clinical development teams. By involving key stakeholders early—including HEOR, commercial, and medical affairs teams— sponsors can expand the utility of their trial tokens to link data across the entire drug development lifecycle.
Tokenize across clinical development programs. Tokenizing multiple trials within a single clinical development program, or across several programs, underpins a comprehensive internal resource like a linkable data infrastructure, facilitating in-depth insights across data assets.
Keep the patient at the center. The ultimate goal of tokenization and RWD linkage is to improve patient outcomes. Tokenization strategies should be designed for transparency and to minimize patient burden, enhance long-term safety monitoring, and generate insights that translate into better treatments.
Partner with experts. A successful tokenization and RWD program requires specialized expertise, and ideally the ability to supplement real-world datasets with key clinical variables from medical records. Sponsors should collaborate with trusted tokenization and RWD providers to ensure that their strategy is fit-for-purpose, scalable, privacy-preserving, and aligned with industry best practices.
What’s next in trial tokenization?
Leading pharmaceutical companies are increasingly investing in tokenizing all new trials, with some creating linkable data infrastructures that can be leveraged across multiple research programs. This shift is positioning tokenization as a critical component of future clinical research and evidence generation.
Spotlight on AnalyticsIQ: Privacy Leadership in State De-Identification
AnalyticsIQ, a marketing data and analytics company, recently adopted Datavant’s state de-identification process to enhance the privacy of its SDOH datasets. By undergoing this privacy analysis prior to linking its data with other datasets, AnalyticsIQ has taken an extra step that could contribute to a more efficient Expert Determination (which is required when its data is linked with others in Datavant’s ecosystem).
AnalyticsIQ’s decision to adopt state de-identification standards underscores the importance of privacy in the data ecosystem. By addressing privacy challenges head-on, AnalyticsIQ and similar partners are poised to lead clinical research forward, providing datasets that are not only compliant with privacy requirements, but also ready for seamless integration into larger datasets.
"Stakeholders across the industry are seeking swift, secure access to high-quality, privacy-compliant SDOH data to drive efficiencies and improve patient outcomes,” says Christine Lee, head of health strategy and partnerships at AnalyticsIQ.
“By collaborating with Datavant to proactively perform state de-identification and Expert Determination on our consumer dataset, we help minimize potentially time-consuming steps upfront and enable partners to leverage actionable insights when they need them most. This approach underscores our commitment to supporting healthcare innovation while upholding the highest standards of privacy and compliance."
Based on the trends we’ve observed, we expect adoption of trial tokenization to expand further in:
Early-phase trials – Rare diseases and personalized therapies willincreasingly rely on tokenization for real-world evidence generation.
Metabolic disorders – Reflecting pipeline growth, more trials indiabetes, cardiovascular, and obesity are expected to adopt tokenization.
Enterprise-wide adoption – Top biopharma companies are movingtoward tokenizing the majority of their clinical trials, setting the stage forricher long-term insights and commercial strategies.
Mid-sized & emerging biotechs – More companies will integratetokenization into early-stage R&D decisions to maximize long-term datavalue.
Building Trust in Privacy-Preserving Data Ecosystems
As the regulatory landscape continues to evolve, Datavant’s state de-identification product offers an innovative tool for privacy officers and data custodians alike. By addressing both state-specific and HIPAA requirements, companies can stay ahead of regulatory demands and build trust across data partners and end-users. For life sciences organizations, this can lead to faster, more reliable access to the datasets they need to drive research and innovation while supporting high privacy standards.
As life sciences companies increasingly rely on SDOH data to drive insights, the need for privacy-preserving solutions grows. Data ecosystems like Datavant’s, which link real-world datasets while safeguarding privacy, are critical to driving innovation in healthcare. By integrating state de-identified SDOH data, life sciences can gain a more comprehensive view of patient populations, uncover social factors that impact health outcomes, and ultimately guide clinical research that improves health.
The Power of SDOH Data with Providers and Payers to Close Gaps in Care
Both payers and providers are increasingly utilizing SDOH data to enhance care delivery and improve health equity. By incorporating SDOH data into their strategies, both groups aim to deliver more personalized care, address disparities, and better understand the social factors affecting patient outcomes.
Tailored Member Programs: Payers develop specialized initiatives like nutrition delivery services and transportation to and from medical appointments.
Identifying Care Gaps: SDOH data helps payers identify gaps in care for underserved communities, enabling strategic in-home assessments and interventions.
Future Risk Adjustment Models: The Centers for Medicare & Medicaid Services (CMS) plans to incorporate SDOH-related Z codes into risk adjustment models, recognizing the significance of SDOH data in assessing healthcare needs.
Payers’ consideration of SDOH underscores their commitment to improving health equity, delivering targeted care, and addressing disparities for vulnerable populations.
Example: CDPHP supports physical and mental wellbeing with non-medical assistance
By integrating SDOH data, CDPHP enhanced their services to deliver comprehensive care for its Medicare Advantage members.
Providers Optimize Value-Based Care Using SDOH Data
Value-based care organizations face challenges in fully understanding their patient panels. SDOH data significantly assists providers to address these challenges and improve patient care. Here are some examples of how:
Onboard Patients Into Care Programs: Providers use SDOH data to identify patients who require additional support and connect them with appropriate resources.
Stratify Patients by Risk: SDOH data combined with clinical information identifies high-risk patients, enabling targeted interventions and resource allocation.
Manage Transition of Care: SDOH data informs post-discharge plans, considering social factors to support smoother transitions and reduce readmissions.
By leveraging SDOH data, providers gain a more comprehensive understanding of their patient population, leading to more targeted and personalized care interventions.
While accessing SDOH data offers significant advantages, challenges can arise from:
Lack of Interoperability and Uniformity: Data exists in fragmented sources like electronic health records (EHRs), public health databases, social service systems, and proprietary databases. Integrating and securing data while ensuring data integrity and confidentiality can be complex, resource-intensive and risky.
Lag in Payer Claims Data: Payers can take weeks or months to release claims data. This delays informed decision-making, care improvement, analysis, and performance evaluation.
Incomplete Data Sets in Health Information Exchanges (HIEs): Not all healthcare providers or organizations participate in HIEs. This reduces the available data pool. Moreover, varying data sharing policies result in data gaps or inconsistencies.
SDOH data holds immense potential in transforming healthcare and addressing health disparities.
With Datavant, healthcare organizations are securely accessing SDOH data, and further enhancing the efficiency of their datasets through state de-identification capabilities - empowering stakeholders across the industry to make data-driven decisions that drive care forward.
Key takeaway: As the volume of trials that Datavant tokenizes continues to grow, a key observation is that sponsors that integrate privacy-preserving linkage solutions early are the ones best-positioned to accelerate research, optimize commercial strategies, and ultimately advance patient care.
It’s Time to Leverage Tokenization and RWD Linkage as a Competitive Advantage
As trial tokenization scales across clinical development, it is evolving from a data privacy tool into a strategic asset that enhances trial design, regulatory and payor submissions, and long-term evidence generation. Sponsors that embed tokenization early in trial planning are better positioned to unlock deeper insights, drive innovation, and improve patient outcomes.
Clinical Trial Tokenization & Data Linkage
How-to Guide: Navigating Clinical Trial Tokenization & Data Linkage