Monday MAY 15 – Tuesday MAY 16

Bio-IT World is proud to bring together innovative data scientists and developers from across the industry to solve real-world data challenges using the principles of Open Source & FAIR Data. 

Over the years, the Bio-IT World Hackathon has delivered a new level of collaboration to the annual Bio-IT World Conference & Expo in Boston. Our 2022 FAIR Data Hackathon, co-facilitated with leaders from the NIH and NVIDIA, brought together 4 teams and over 75 participants to solve real data challenges.

Back again in 2023, the fifth annual Bio-IT Hackathon will continue in the tradition of uniting life science and IT teams to tackle actual bioinformatics projects with maximum impact potential. Projects at the event will feature either Open Source tools or some or all aspects of making data findable, accessible, interoperable, and reusable. All projects will be broadly applicable to the data science community.

The first three projects have been announced and are detailed below—stay tuned for additional projects!


Project 1: Gene Trends
Broad Institute of MIT and Harvard

About the Project: 
The popularity of specific genes over time tells the story of society's focus on biomedicine and genomics. We will extend and refine Python code to extract and transform longitudinal gene page view counts from Wikipedia, and gene publication citations counts from PubMed. These count data will be updated via GitHub Actions, hosted on GitHub. Visualizations will leverage reusable web components -- e.g. interactive plots and genome visualizations -- and any other interfaces that participants build. We'll also explore this data on interactive notebooks in Terra. Let's surface this history of genes and make it easily accessible to all. 

Why this Project is Applicable to Others in the Community: 
Use cases: 
  • Alert users to genes that have signals of breaking news 
  • Enable users to assess relative scholarly and popular interest in any gene 
  • Give journalists or science writers data to inform works on changes in society's focus on biomedicine and genomics 
  • Give software engineers in biomedicine and genomics a useful dimension and dataset to rank genes by priority 
  • Beyond that output data, this will also give team participants a venue to learn more about Python, JavaScript, GitHub Actions, interactive notebooks, and Terra. 
How Project is Open Source and/or FAIR: 
Data and source code will be freely and openly accessible. Concrete examples will demonstrate reusability, and foster it in ways that directly add value for engineers and data scientists. The project itself is, essentially, to make data that is already public much more findable and accessible in ways that add value for researchers, funding agencies, historians, and anyone in the public who is interested in genes.

Project 2: Integrating the Pebblescout and Other Similar SRA (Sequence Read Archive) Search Indexes Into Computational Workflows

About the Project:
We will be integrating the Pebblescout and other similar SRA (Sequence Read Archive) search indexes into computational workflows that also leverage cloud-hosted, precalculated data from the SRA Taxonomy Analysis Tool (STAT), preassembled contigs, and metadata associated with SRA samples, and ElasticBLAST. 

Why this Project is Applicable to Others in the Community: 
The more than 40 petabytes – and growing – of data available from the NIH Sequence Read Archive (SRA) and other public resources provide a vast reservoir of information that can be tapped to accelerate biomedical discovery. Working with this data, participants will apply tools and resources to several large-scale biological questions including but not limited to: 
  • Establish biological relationships between specific genes or organism groups including pathogens and environmental attributes like geography, season, biome, etc. 
  • Utilize the environmental and temporal diversity of SRA samples to investigate the composition of pangenomes across organisms at strain, species, and genus levels. 
How Project is Open Source and/or FAIR: 

To fully realize the potential of SRA, it must be findable and accessible. To this end, SRA has been distributed via the Amazon Web Services Open Data Sponsorship Program (AWS ODP) and the Google Cloud Platform (GCP), enabling efficient access to the entire sequence data corpus as well as the metadata describing the technical, biological, and environmental context under which individual sequence samples were obtained. The NIH and Department of Energy (DOE) are now collaborating to bring sequenced-based search tools capable of scanning the entire petabyte scale SRA to the scientific community.

Project 3: Using MYC Amplification as a Prior in Gene Expression Analysis in Congenital Heart Disease and Cancer in Kids First and INCLUDE Data Set

About the Project:

MYC is an oncoprotein and often implies worse outcomes, however it also seems to have a role in cardiovascular disease. Using open data from both the INCLUDE Data Hub and the Kids First Data Resource Portal, this Hackathon will:

The workflows and notebooks will be on GitHub, all processes will be containerized. We will build and load the workflow created onto CAVATICA. We will push the data from Kids First and INCLUDE onto CAVATICA. 

 What is the Gabriella Miller Kids First Pediatric Research Program? The Gabriella Miller Kids First Pediatric Research Program (Kids First) is a trans-NIH Common Fund program whose goal is to help researchers uncover new insights into the biology of childhood cancer and structural birth defects, including the discovery of shared genetic pathways between these disorders. To achieve this goal, the program has developed the Gabriella Miller Kids First Data Resource, a cloud-based platform which publicly shares genetic and clinical data from childhood cancer and structural birth defect cohorts, and includes a portal (https://portal.kidsfirstdrc.org/) and other tools to foster analysis and collaboration.

What is the INCLUDE Project? The NIH INCLUDE (IN vestigation of C o-occurring conditions across the L ifespan to U nderstand D own syndromE ) Project seeks to accelerate the discovery of etiology and biologic pathways underlying conditions that co-occur with Down syndrome. The mission of the INCLUDE Data Hub is to connect and empower the Down syndrome stakeholder and research community on behalf of accelerated impact through scientific discoveries that will improve and enrich the lives of people with Down syndrome. The Data Hub is a collaborative, secure cloud-based resource that includes accessible, multimodal data, reproducible bioinformatic tools and pipelines, and workspaces supporting scalable computation and analysis.  

Project 4: Building Knowledge Graphs to Capture Disease Sub-Typing and Subsequent Drug Efficacy Information

About the Project: 

We are aware that many folks are interested in subtyping disease for coherent drug treatment. We have done some preliminary work demonstrating that disease subtyping can be predicted and validated using large biobank-style data. We'd like to demonstrate the capture of that information in knowledge graphs such that they can both be loaded into and validate cutting edge data models.

How Project is Open Source and/or FAIR: 
The entirety of the project will be on github, and will pull from a variety of open-source datasets.

Project 5: kidSIDES
Regeneron Pharmaceuticals

About the Project:

kidSIDES is the largest, open database with 500K pediatric drug safety signals (https://kidsides.nickg.bio/) where ~20,000 significant pediatric drug safety signals were identified and the association between Montelukast and Psychiatric disorders was corroborated. The resource allows generation of more hypotheses such as adverse effect profiles, drug toxicity during childhood, and genetic susceptibility of pediatric adverse drug effects.

Why this Project is Applicable to Others in the Community: 
The database is downloaded and cached on your machine using the R package kidsides on CRAN. This sqlite database contains 17 tables with drug, adverse event, gene, effect class and drug class identifiers that allow for analysis and making knowledge graphs.

How Project is Open Source and/or FAIR: 
The project would be based on open data, from kidsides or elsewhere such as the OpenPTBA from the ccdatalab.org. The analysis and code would be published on github.

Sponsorship opportunities are available!

For partnering and sponsorship information, please contact:

Companies A-K
Rod Eymael
Business Development Manager
Cambridge Healthtech Institute
Phone: (+1) 781-247-6286
Email: reymael@healthtech.com


Companies L-Z
Aimee Croke
Business Development Manager
Cambridge Healthtech Institute
Phone: (+1) 781-972-5458
Email: acroke@cambridgeinnovationinstitute.com

Exhibit Hall and Keynote Pass

Data Platforms and Storage Infrastructure