Editorial Director Allison Proffitt of Bio-IT World recently interviewed John Quackenbush of Harvard T.H. Chan School of Public Health and Dana-Farber Cancer Institute. Dr. Quackenbush shares his keynote presentation on “Using Networks to Link Genotype to Phenotype” at the Clinical Genomics conference taking place May 23-25, 2017 at Bio-IT World Conference & Expo in Boston, MA.

BIOINFORMATICS AND DRUG RESEARCH: OPPORTUNITIES AND ROADBLOCKS

Q: What are the biggest opportunities and biggest roadblocks to bio-IT, research computing, drug discovery, and precision medicine in the next 15 years?

I typically open my presentations with the observation that revolutions in science are driven by one essential thing—data. Data is the raw material that we use to build models, to verify or falsify them, and to then reiterate and repeat the process. When I consider where we are in health and biomedical research today, I find myself very excited by the scope and scale of the data available to us. Whether it is the exponentially growing number of genome sequences and increasing number of multi-omic studies, the expansion of electronic health records, the population studies that follow thousands or even hundreds of thousands of individuals, or the large pharmacogenomics screens available, the size, complexity, and diversity of the individual datasets that are being generated [in the bio-IT space] is astounding.

As Professor in Computational Biology at the Dana-Farber Cancer Institute and Harvard T.H. Chan School of Public Health, I and my research group have taken advantage of these and other types of datasets that were inconceivable just a few years ago to develop new analytical methods that are providing unprecedented insights into the biology of health and disease.

We have made substantial advances in our use of genomic data. For example, we have developed a way to model gene regulatory networks and to compare network structures between healthy and disease states, which has helped us to identify new drug targets, to understand disease drivers, to explore sexual dimorphism in healthy and disease states. We have extended this method to infer gene regulatory networks for each individual in a group which has opened up the potential for applications in precision network medicine. My colleagues and I have used graphical methods to analyze the link between genetic variants and gene expression, and in the process we have discovered how small-effect genetic variants may work synergistically to effect changes in phenotype.

Beyond using genomic data alone, my work with Hugo Aerts has allowed us to recognize that quantitative measures of tumor morphology in CT scans can in some instances predict the mutational status of tumors. And we are seeing evidence that even cellular morphology in healthy tissues is linked to an individual’s genotype.

These advances have all been possible thanks to access to multiple sources of independent data on large numbers of individuals, and we are indebted to those who have made them available. But unfortunately, our ability to make advances is often limited by incorrect, incomplete, or inadequate data.

For example, when we looked at gene expression in GEO, we discovered a surprisingly high rate of misidentification of the sex of a research subject—something easy to detect using gene expression data on Y chromosome genes. If sex is wrong, one has to wonder how many other metadata variables are incorrect.

Incomplete or inadequate data can also present barriers to progress. Despite the massive quantity of data collected in cancer, my colleagues and I have struggled to find a dataset that includes gene-expression data, tumor grade and stage, outcome, and drug treatment. While incomplete datasets may seem like a minor annoyance, they can have important ramifications. For instance, not knowing what drugs patients were treated with could easily confound a survival analysis. Incompleteness may also factor into the poor reproducibility of many published studies.

Outside the field of academic research, we are seeing how the lack of data standards and inconsistent collection methods can hamper progress. A few years ago, my colleague, Mick Correll, and I launched a company called Genospace. In 2016, we unveiled an advanced clinical-trial matching application that uses patient information parsed from an EMR along with trial inclusion criteria to match patients with appropriate trials. Solving the problem was immensely challenging, since different implementations of the same EMR system can store data in different ways and in different formats. Trials are not consistently annotated, and so the roadblocks that lie between patient data and inclusion criteria are a myriad of ontologies, controlled vocabularies, data dictionaries, and missing or unparsable data. While the field often pays closer attention to overcoming the challenges of managing big data, the real problem is that we have messy data.

Nevertheless, I believe that we should be optimistic about the future. More biologists and bioinformatics scientists are quickly learning how to deal with messy data, including mastering the art of data wrangling, to create large and increasingly useful data resources. I believe that these data sources will drive innovation and that will advance our understanding of basic biology, help us to identify new drug targets, let us evaluate those drugs more thoroughly and more quickly, and open up opportunities to intelligently repurpose existing therapies. Ultimately, it is data that will help us move from the somewhat anecdotal way in which medicine is currently practiced to a new era in which information allows us to make more-informed decisions about patient care.

Having worked in bio-IT for many years, it is exciting to watch as health and biomedical research evolves into an information science where the scope of what we can do will be limited only by the quality and quantity of the data we can effectively access.

Speaker Information:

John_QuackenbushJohn Quackenbush, Ph.D., Professor, Biostatistics, Harvard T.H. Chan School of Public Health; Professor, Biostatistics and Computational Biology, Dana-Farber Cancer Institute

John Quackenbush received his Ph.D. in theoretical physics from UCLA in 1990. Following a postdoc in physics, he received an NIH fellowship work on the Human Genome Project. He spent two years at the Salk Institute and two years at Stanford University before moving to The Institute for Genomic Research in 1997. John joined the Dana-Farber Cancer Institute and the Harvard TH Chan School of Public Health in 2005, where he uses computational and systems biology to better understand the complexities of human disease. John has numerous awards including recognition as a 2014 White House Open Science Champion of Change.

Read an edited version of Dr. Quackenbush’s comments along with others’ thoughts at Bio-IT World: Fifteen Years: Where Biotech Has Been And Where It Is Going


Register

Conference Tracks

Data Platforms & Storage Infrastructure