Deep Learning Augmented Genome Mining in the "omics" Era
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Bacterial specialized metabolite (SM) scaffolds are fundamental to many important
medicines, including antibiotics. Widespread dissemination of antimicrobial resistance
demands the isolation of mechanistically and structurally novel therapeutics to enable
lifesaving medical interventions. The meteoric growth of genomic sequencing data has
uncovered millions of biosynthetic gene clusters (BGCs) encoding SMs. However, much
of this chemical space remains unexplored due to technical limitations in BGC comparison
and limited strategies for BGC prioritization. In this thesis, I develop deep learning
algorithms which enable high-throughput comparison, structural rationalization,
bioactivity prediction, and defragmentation of BGCs to enable large-scale BGC
prioritization for SM-based drug discovery efforts. Firstly, I develop Transformer-based
deep learning algorithms to identify and represent BGCs using highly scalable, vectorized
representations. These algorithms drastically outperform the current state of the art and
enable rapid comparison, grouping, and prioritization of BGCs at an immense (>1 million
BGC) scale. Secondly, I develop computational methods to biosynthetically link SMs to
candidate BGCs, increasing the dataset of potential SM-BGC relationships eight-fold
relative to current datasets. This method also enables prioritization of BGCs encoding
structural novelty and streamlines the isolation of SMs in a rationalizable fashion, leading
to the isolation of a novel lipopeptide. Thirdly, I develop computational methods to identify
bioactive molecular and genetic signatures present in BGCs and use these methods to
streamline the isolation of a novel antitubercular peptide. Finally, I demonstrate a method
enabling BGC defragmentation with scalable BGC fragment representations, facilitating the identification and comparison of discontiguous BGCs. Critically, the advances in this
thesis leverage highly scalable vectorized representations which are capable of managing
the extreme dataset sizes being created in the era of “multi-omics” data. Together, this
work provides a means to leverage the immense wealth of genomic data to prioritize novel
BGCs for streamlined, targeted SM-based drug discovery.