Industry Perspectives on Enterprise Data Management with Anthony Philippakis (Broad Institute)
Head of Data Strategy
In our Data & Analytics Thought Leader Series, Datavant’s Head of Data Strategy, Su Huang, interviews leaders who are responsible for defining the teams and processes for managing data and advancing data-driven use cases at their organization. Today’s interview is with Anthony Philippakis, Chief Data Officer at the Broad Institute.
Anthony is committed to bringing genome sequencing and data science into clinical practice. He started his career as a cardiologist at Brigham and Women’s Hospital. Motivated by a desire to build scalable change in healthcare, he moved into technology, first as a product manager at the Broad Institute and later becoming Chief Data Officer (CDO). As CDO, Anthony built the 250-person Data Sciences Platform team to manage data at scale and fuel the next wave of discoveries in the biomedical research community. Currently, Anthony co-directs the Eric and Wendy Schmidt Center (EWSC), which is focused on advancing research at the interface of machine learning and biomedicine. Anthony also builds and invests in new companies as a Venture Partner at GV.
Anthony Philippakis, M.D., Ph.D. Chief Data Officer, Broad Institute
Su: Anthony, thank you so much for joining us for this interview series with data and analytics leaders in healthcare. You have an amazing array of experiences from being a physician by training to venture investing to building a software organization at Broad. Can you give us an overview of the Broad Institute and your remit there as Chief Data Officer?s. Can you describe your background and experiences?
Anthony: The Broad Institute of MIT and Harvard was formed in the wake of the human genome project, which was a time of significant change in biological research on two fronts. First, there was interest in large-scale, systematic approaches to biology. Second, there was appreciation that bringing people with diverse skillsets together could be transformative.
The Broad Institute embodies these two philosophies – (1) taking large-scale, systematic approaches to biology and (2) bringing together diverse research teams to execute on them.
My current focus as Chief Data Officer (CDO) is building the Eric and Wendy Schmidt Center (EWSC) in partnership with Caroline Uhler (a professor at MIT and leader of machine learning) and leading a team of data scientists to advance machine learning applications for medicine and biology. In addition to that, I am a member of the Institute’s Executive Leadership Team and continue to work closely with the Data Sciences Platform (DSP), currently led by Clare Bernard.
Su: Can you describe more about the DSP and the EWSC at Broad? What is the mission and vision for each?
Anthony: The Broad Institute recognizes that the greatest transformation occurs when you can tackle challenges that are both intellectually difficult and operationally difficult. I love that.
The life sciences are in the midst of a data revolution. It is time for new approaches to making biology a data science, as well as ways to operationalize and disseminate ideas at scale. At Broad, we focus on both goals.
The EWSC centers around research and innovation at the interface of biomedicine and machine learning (ML). This is a big field and our focus is on taking the most important questions of biology and using them to drive the next generation of foundational advances in ML. This is different than the typical process of “bringing ML into biology.” There is reason to believe that biology can drive ML in new directions. For instance, in biology, we can conduct perturbations on a larger scale than in most fields. Similarly, we are less concerned with achieving state of the art on a benchmark dataset, but more interested in mechanism. Both of these elevate questions of causal inference to the forefront, which has not been as central to modern ML.
The DSP centers around building a scalable platform called Terra, in conjunction with Microsoft and Verily, that enables researchers to store, share and analyze genomic and clinical data at scale. The mission is to build a software platform that spans the lifecycle of biomedical data. As a modern cloud-based software platform, Terra brings different user personas together and creates value from their interactions. In particular, we are using Terra to:
Create software that is patient-facing to recruit patients for research studies
Conduct data engineering so that large genomic and clinical datasets are quality-checked, staged, and ready for research use
Build ML tools that enable researchers to analyze genomic and clinical datasets
Su: Very interesting – I’ll come back to Terra later. I would love to know how you think about “enterprise data management”, which is the theme of this interview series. To me, enterprise data management represents the processes that an organization undertakes to put data, which may sit across many silos and have disparate rules for use, into a unified infrastructure with a standard process to unlock insights that inform business decisions. Data includes both primary data generated directly by your organization as well as data that Broad has access to via partnerships or licensing. What else do you think about as it relates to “enterprise data management” in your role as data leader for the Broad Institute?
Anthony: That’s a great way to define the basic challenge. I’ll add that for what we do, the scope is much larger — we’re not just looking at data management for the Broad and immediate partners; we’re working to implement data management infrastructure and processes that will serve the needs of the global biomedical research community. Organizations themselves can act as data silos. In particular, many research organizations have data that would be more valuable if it could be federated across those organizations, so we are building solutions to do just that. This benefits the participating organizations since they will be able to gain more insights from their own data. This also benefits patients and study participants who donated the data to begin with.
Su: How does this level of federation help unlock more insights out of the data?
Anthony: Federated learning enables ML models trained on distributed datasets. Two primary use cases are:
Increasing statistical power: For many questions in biology, we need massive amounts of data to see effects. A classic example would be human genetics, where we need to pull together data across hundreds of thousands of individuals to find genetic variants that increase or decrease the risk of common diseases like diabetes.
Bringing together orthogonal datasets to make new discoveries: The classic example is understanding the way in which a mutation in the genome impacts risk of disease. What are the pathways that are involved? How do they roll up into physiologic functions? How do these functions go awry? Answering these questions involves pulling together many different datasets at many levels of human biology.
Federated learning allows such insights to be unlocked, even while the data remains distributed.
Su: How is Broad using federated learning in the Terra platform?
Anthony: We inverted the traditional model of data sharing: instead of having research organizations download copies of the data to their respective silos, you put the data on the cloud, with mechanisms for researchers to access and compute on it in-place. In partnership with Microsoft and Verily, we built the Terra platform to support secure data storage, data sharing and collaborative analyses with built-in tools and interfaces that are tailored for life sciences researchers.
Su: There are a lot of data platforms out there, so what differentiates Terra?
Anthony: We have three strategic pillars:
First, Terra is focused on life science use cases, which allows us to prioritize the needs of those users. We can focus, for instance, on making Terra easy to work with common life science data models such as OMOP. We have Domain Specific Languages (DSLs) that make it easy for life sciences researchers to build workflows that leverage common processing patterns seen in genomics.
Second, we focus on flagship scientific projects as drivers for building out Terra capabilities. These include All of Us, Human Cell Atlas, and working with public health departments around the world to perform COVID sequencing.
Third, we are building a federated data ecosystem, not a walled garden. This shows up in a few ways. We build things that are modular and not monolithic. We believe that software should be community-driven. We lean into standards such as GA4GH (Global Alliance for Genomics and Health), which Broad helped found. We are open source, have open APIs, and are committed to sharing data.
Su: How do you reconcile this drive for openness with the need for data security, privacy and compliance?
Anthony: At the platform level, we build to extremely high standards of information security – Terra is rated FedRAMP Moderate, which enables us to store sensitive data from federal projects, such as All of Us. This level of security is crucial for groups handling health data.
Beyond that, we seek to develop technologies that facilitate compliance. One example is an effort called “DUOS” (data use oversight system). Life sciences data within Terra has two axes of access control – one based on who you are and one based on intended use for the data. Who you are is computable, since it is based on IAM (identity and access management) systems that are widely utilized. However, intended use is not currently computable. We changed that by building an ontology that summarizes research purposes, so that they can be computed. When we first took this to our IRB, they said “you can’t automate my job!” We actually ran a trial that was published recently in Cell Genomics showing that the IRB’s opinions and DUOS match up very closely!
Su: Terra is solving the problem of giving researchers access to data. What next? What’s holding the industry back from unlocking more insights from massive amounts of genomics data? Is it technology-based factors such as computing power? Is it people-based factors such as expertise with genomics?
Anthony: It’s a mix of both – scaling genomics analysis traditionally requires a lot of computing power and specialized engineering knowledge. With Terra, we can help with both. We use the cloud to make scalable computing power available to all, and we provide pre-built tools and interfaces to allow researchers to use the data without needing special training. However, there is still a lot of work to be done in terms of algorithm development to achieve next-level insights.
Su: Everything you describe here sounds extremely cross-disciplinary. In past interviews, you’ve spoken about your dream to see a health IT company where the CTO was previously the CTO of Angry Birds, the Chief Medical Officer was a practicing physician who also knows programming, and the CEO was from ad-tech – essentially this idea that Silicon Valley and the healthcare community needs to cross-pollinate so that tech and healthcare can learn from each other. How have you tried to institutionalize this vision in the team you’ve built at Broad?
Anthony: That’s exactly what we’ve done with the Data Sciences Platform. We intentionally structured it like a software development organization, with deep connections to research teams at Broad and beyond, so that all relevant expertise is immediately available to cross-pollinate ideas. I hope that we can replicate this with the Eric and Wendy Schmidt Center to turn foundational discoveries in machine learning into production grade tools.
Su: Last question – if you had unlimited resources and could accelerate one area of what you are overseeing at the Broad Institute, what would that one area of investment be?
Anthony: I am very passionate about improving the state of common disease drug development. Drug development is focused almost exclusively on rare diseases and cancer. There are actually few organizations seeking to develop drugs for the top 10 causes of death! For a long time that was because we lacked good targets, but human genetics has really changed that. The challenge now is the high cost of common disease clinical trials. Running a Phase 3 trial for a disease like coronary artery disease can cost more than $1b and take 5 years because of the high number of patients you need to recruit and the long time period needed to have events for the primary endpoint of the trial.
However, there is a new generation of ML tools based on genomic and clinical data that can predict both who will be more likely have events, and who will be more likely to respond to therapies. This would allow for smaller trials. I am excited about working to develop these ML instruments through the Eric and Wendy Schmidt Center, and potentially accelerate common disease drug trials.
Su: That would be a worthy cause indeed. Anthony, thank you so much for sharing these insights!