As part of the tragedy of COVID-19, tens of thousands of new cases are diagnosed in the US each day. In America, the data from these patients are captured across electronic medical records, medical claims, diagnostic tests, pharmacies, and mortality records.
Buried in this data are the answers to many of the questions vexing researchers: how the disease progresses, the impact of various co-morbidities, the safety and efficacy of various therapeutics, the genetic correlations of the disease, and the demographic disparities of the disease’s impact. Tens of thousands of researchers are hard at work trying to understand snapshots of the data to improve our understanding.
But the data is fraught with challenges: it is siloed across institutions, coded differently in different contexts, often incomplete, and not randomized. Data must be fit for purpose to answer the questions we face as a society. And patient privacy must be protected as researchers answer these questions.
This post is intended as a guide to the real-world data available within America’s health data infrastructure for the tens of thousands of researchers seeking to better understand COVID-19.
***
While the COVID-19 pandemic has resulted in enormous suffering and cost, it also has been a catalyst for changes that healthcare industry veterans, innovators, and patients have spent decades advocating for, and which are now happening in a matter of months.
Perhaps the most significant of these is the use of so-called “real-world data” (or “RWD”) – that is, any health data that is collected in the ordinary course of care rather than in the context of a controlled clinical trial. Researchers can use RWD to better understand disease burden, the effectiveness of drug and non-drug treatments, and the impact of both government and social support programs on disease outcomes.
Used correctly, real-world data can answer questions like:
Real-world data has enormous advantages, including scale (each year, across the U.S., people have billions of interactions with the healthcare system), timeliness (much of this data can be made available within weeks or days), and cost (compared to large randomized controlled trials that may cost north of $100 million to run, RWD studies are inexpensive to conduct). Of course, such data also has limitations – it is fragmented across the healthcare system, can be messy and incomplete, and studies need to be thoughtfully designed to prevent unjustified inferences.
Real-world data is a vast and still mostly untapped resource in U.S. healthcare, but we will have to rapidly improve our ability to make effective use of it. That means understanding what combinations of real-world data are “fit for purpose” to answer different research questions, and what data can be used to validate findings through parallel studies.
And importantly: it is critical that information collected as real-world data is collected ethically, that data is adequately de-identified, and that patient privacy is protected throughout the process.
We’re providing here a guide to (1) the value and limitations of major real-world data types and (2) the necessity of linking multiple RWD data sets at the patient level to create true “fit-for-purpose” data sets, filling gaps and creating more valuable data sets for a variety of pressing research questions.
As we’ve written about in the past, data in the US is distributed across a complex ecosystem of thousands of institutions, captured in this diagram:
Below is an overview of the major types of data that can be used for COVID-19 related studies, and the strengths and weaknesses of each data type for specific analyses.
Mortality data. despite being one of the most important endpoints in healthcare analysis, is surprisingly hard to come by in traditional healthcare data sets. Unless a patient dies during care (in a hospital or long-term care facility), the event is not recorded in EHRs or other standard data flows. Mortality data may include the date of death, cause of death, and other demographic and geographic information on the deceased.
Fit-For-Purpose Assessment: Given its absence from many traditional health data sets, mortality data must be compiled from a variety of sources, including government databases (death certificates are available from states, and both CDC and the Social Security Administration maintain death indices), obituary data, and/or life insurance data.
Electronic health record (EHR) data are collected in the ordinary course of hospital and ambulatory care visits. The information is entered directly by the physician or nurse, and that information is then supplemented with lab, imaging, radiology, and genetic testing results as it is sent to the treating physician for record keeping.
Fit-For-Purpose Assessment: De-identified EHR (also called electronic medical record, or EMR) data is well purposed to understand the entire patient experience at a single provider, and covers symptoms, diagnostic testing, and medical treatments and procedures. EHR data is best used to understand how a physician arrived at a diagnosis and treatment decision, as the patient record should contain the information that was available to the physician in making their assessment.
However, numerous vendors provide EHR software, and therefore, each provider facility may use a different system. This fragmentation means that as a patient is referred from a PCP to a specialist; or from an outpatient setting to in-patient or long-term care, that patient’s data will often be recorded in different systems. Therefore, EHR data are insufficient to understand the patient journey over a long period of time, whether assessing the patient’s path across providers to arrive at a final diagnosis, or trying to understand the long-term outcomes of a treatment decision.
Additionally, while EHR data will commonly contain drug prescription information, it is important to remember that this data only reflects what the physician has prescribed, and not what the patient has actually filled at a pharmacy. For that information you need to use pharmacy claims data (discussed below).
To best understand which de-identified EHR sources are fit for your analytics, it is important to know the different types of EHR systems used by providers:
Medical claims data are created every time a patient receives a medical service that is billed to their insurer. An insurance claim is sent from the provider to the payer in a standardized format (the 837 EDI format) and includes information about the visit to be paid for (patient demographics, diagnosis and procedure codes, dates of service, etc.) and information required for payment, such as the doctor, treating facility, and patient’s insurance provider.
Fit-For-Purpose Assessment: Medical claims are captured at a number of points, either by software used at the provider to generate the claim, at a claims clearinghouse that routes those claims to the proper payer, or at the payer themselves. Because of their standard format and use across all specialties and provider types, claims are an excellent way to get large sample sizes for disease epidemiology and comorbidities, understanding the provider landscape, and following a patient’s journey across the health landscape. A medical claim can also contain drug treatment information for drugs that are administered by the doctor (e.g., infusions of drugs like Remicaid), but will not contain retail pharmacy prescriptions.
However, the medical claim format does not include many important data fields necessary to understand why a diagnosis or treatment decision was made because it does not contain any diagnostic testing information, physician notes, or other details. For that, de-identified medical claims need to be linked with de-identified EHR data. Additionally, a medical claim is only filed for a reimbursed service, and therefore this data type is not useful for analyzing any service for which the patient pays out-of-pocket (e.g., over-the-counter drugs).
There are multiple sources of medical claims data, and each has specific strengths and weaknesses:
Pharmacy claims data are generated by pharmacies in order to be paid for the cost of a prescribed drug or medical supply that is dispensed to a patient (the patient pays the copay, and the remainder is billed to the payer).
Fit-For-Purpose Assessment: A pharmacy claim will include dosage information, drug strength, fill dates, financial information, and de-identified patient and prescriber codes and is the best source for understanding prescription patterns by providers. For this reason, this type of data is the backbone of incentive compensation calculations for pharmaceutical sales reps, and is closely tracked by brand teams (and financial services) to understand market share, competitive dynamics, and changes in prescribing behavior due to new drug launches, generic drug entry, and marketing campaigns. Pharmacy claims are also generated for mail order prescriptions, which are particularly important when studying chronic conditions.
Pharmacy claims by their nature do not capture transactions that are not billed to insurers, but are instead paid for out-of-pocket by the patient (e.g., over-the-counter medications). Likewise, pharmacy claims may miss capturing specialty drug treatments where the drug is injected or administered in a medical setting. Such drugs are covered as medical benefits and billed as medical claims.
Like a medical claim, a pharmacy claim can be captured at the point of generation (the pharmacy), at a claims clearinghouse, or at a payer. Each source has different advantages and disadvantages:
Claims remittance data are generated when a payer reviews a medical claim and determines how much they will pay the provider (a process called “adjudication”). Claims remittances are sent from the payer back to the provider using a common format (the 835 EDI).
Fit-For-Purpose Assessment: A claim remittance has important cost information for the services provided to a patient, and is a key data source for cost effectiveness and pharmacoeconomics studies. However, the claim remittance does not include the clinical information, and therefore must be paired with a medical claim (using a unique claim ID) to understand what the payment is for.
Claims remittance data may be captured through the same set of participants discussed above for medical claims, with generally the same advantages and disadvantages except that remittance data can have a long time lag given the long adjudication process at some payers.
Chargemaster and group purchasing data is billing data available from the financial systems of providers and group purchasing organizations (GPOs), the entities that help groups of providers realize savings by aggregating purchase volume and negotiating discounts with vendors. Chargemaster data is at the patient level, and includes information on the charges they incur during inpatient care. GPO purchasing data is often at the facility level, and includes the bulk supplies they have ordered.
Fit-For-Purpose Assessment: Chargemaster and GPO data are good sources of information for the assessment of the overall cost of goods and supplies for cost effectiveness studies. However, critical details such as individual drugs are often not line-itemed in the chargemaster data and so detailed analytics such as pharmacoeconomics are not possible using this data.
Diagnostic lab testing data are collected at the testing facility. Most diagnostic lab tests are sent out to third-party lab testing services (such as Quest or LabCorp), though for inpatient care the hospital may perform the lab tests themselves on premise.
Fit-For-Purpose Assessment: While the ordering of a lab test is recorded in EHR and claims data sets, the results often are not included in EHRs and not included in claims at all. Lab testing data is vital for understanding the severity of a disease (e.g., level of cholesterol). Such data is also critical to distinguish between suspected disease and actual disease. Just because a physician diagnoses someone with a condition (and uses the corresponding ICD code) does not mean that the physician’s diagnosis will be confirmed by the subsequent lab test.
While enormously valuable, lab data can be challenging to work with because test codes are not standardized across different lab data sources, and lab data is fragmented across a large number of small labs. Further, nearly half of lab testing is run at hospital-affiliated labs and individual physician’s offices, leading to major data gaps.
Genetic testing data are becoming more important as medicine becomes more personalized. Genetic testing labs specialize in identifying specific genetic variants that are known to be associated with diseases of interest.
Fit-For-Purpose Assessment: Genetic testing data is a vital element in most oncology analytics, as the results offer vital information about the cancer type and the treatment that is most likely to be effective. Any study that is looking at patient segmentation, treatment choice, and outcomes needs to include the genetic biomarkers of the cancer in their analysis. As more drugs come to market for niche populations, especially in rare and orphan indications, genetic test results will grow in importance.
Genetic testing data is spread across a large number of genetic testing companies, including companies like Foundation Medicine, Invitae, Myriad Genetics, and Qiagen. Each company in this space has developed its own genetic testing panel composed of a different mix of genetic variants that are assessed. This lack of standardization makes it difficult to aggregate results to inform analyses of the overall patient population, and the ideal fit-for-purpose testing source is likely to differ depending on the disease of interest. Further, the sensitive nature of the data forces many of these organizations to be extremely careful about when and to whom they make the data available and requires researchers to take additional measures to protect patient privacy.
Genomic (DNA) sequencing data are collected by specialized labs (e.g., Helix), as well as by consumer vendors (e.g. 23&Me and Ancestry.com) to educate interested customers on their ancestry and predisposition to various conditions.
Fit-For-Purpose Assessment: Genomic sequences are very important for uncovering potential genetic causes for undiagnosed diseases, and for identifying broader and/or novel biomarker associations with diseases of interest. For these purposes, these data are only effective when paired with a comprehensive medical history for the individual and their family members that offers the corresponding phenotypic data. Further, working with genomic sequencing data requires a high degree of specialization (more so than other real world data types), and poses privacy challenges given the highly detailed information available in such data.
Patient registries are disease-specific collections of data of exceptional depth. Most registries are supported by medical or academic societies to support in-depth research of their patients, and are built from painstaking chart reviews, patient surveys, and collation of other sources.
Fit-For-Purpose Assessment: Patient registries are often the deepest single data set available to assess a disease state, with a strong collection of clinical data and medical history, often supplemented with patient-reported behaviors and outcomes. However, due to the time and expense of collecting this data, registries often resemble clinical trials with smaller sample sizes, and substantial time lags due to intermittent collection periods. And, like clinical trial data, they are tightly controlled by their owners.
Social determinants of health (SDOH) data may be captured to a limited degree in traditional health data sets, but more detailed information around race, behavior, and socioeconomic status is often more systematically collected in non-traditional datasets.
Fit-For-Purpose Assessment: Today, we are seeing wide variances in health outcomes by sex, race, and along community lines that can only be partially understood by analyzing traditional health data – demographic data is key to understanding these disparities. Renewed focus on SDOH has given rise to new health data companies like Socially Determined that are focused on providing useful SDOH data and analytics for healthcare companies. Much of this data is collected from non-traditional sources that have more often been used in consumer and marketing settings than in health settings.
Demographic data only becomes valuable when linked with traditional health data, and such links have to be done carefully and thoughtfully to ensure that patient privacy is protected. Given the amount of non-healthcare data available and the inability to include it all from a privacy perspective, it is important to be selective about which data type is best suited for each use case:
Explore how Datavant can be your health data logistics partner.
Contact us