Deep Learning Augmented Genome Mining in the "omics" Era

Spencer, Norman R.

Please use this identifier to cite or link to this item: http://hdl.handle.net/11375/32239

Title:	Deep Learning Augmented Genome Mining in the "omics" Era
Other Titles:	DEEP LEARNING AUGMENTED GENOME MINING IN THE “OMICS” ERA
Authors:	Spencer, Norman R.
Advisor:	Magarvey, Nathan A.
Department:	Biochemistry and Biomedical Sciences
Keywords:	Natural Products;Genomics;Artificial Intelligence;Transformer;Graphormer;Specialized Metabolism;Metabolism;Knowledge Graphs;Biosynthesis;Bacteria
Publication Date:	2025
Abstract:	Bacterial specialized metabolite (SM) scaffolds are fundamental to many important medicines, including antibiotics. Widespread dissemination of antimicrobial resistance demands the isolation of mechanistically and structurally novel therapeutics to enable lifesaving medical interventions. The meteoric growth of genomic sequencing data has uncovered millions of biosynthetic gene clusters (BGCs) encoding SMs. However, much of this chemical space remains unexplored due to technical limitations in BGC comparison and limited strategies for BGC prioritization. In this thesis, I develop deep learning algorithms which enable high-throughput comparison, structural rationalization, bioactivity prediction, and defragmentation of BGCs to enable large-scale BGC prioritization for SM-based drug discovery efforts. Firstly, I develop Transformer-based deep learning algorithms to identify and represent BGCs using highly scalable, vectorized representations. These algorithms drastically outperform the current state of the art and enable rapid comparison, grouping, and prioritization of BGCs at an immense (>1 million BGC) scale. Secondly, I develop computational methods to biosynthetically link SMs to candidate BGCs, increasing the dataset of potential SM-BGC relationships eight-fold relative to current datasets. This method also enables prioritization of BGCs encoding structural novelty and streamlines the isolation of SMs in a rationalizable fashion, leading to the isolation of a novel lipopeptide. Thirdly, I develop computational methods to identify bioactive molecular and genetic signatures present in BGCs and use these methods to streamline the isolation of a novel antitubercular peptide. Finally, I demonstrate a method enabling BGC defragmentation with scalable BGC fragment representations, facilitating the identification and comparison of discontiguous BGCs. Critically, the advances in this thesis leverage highly scalable vectorized representations which are capable of managing the extreme dataset sizes being created in the era of “multi-omics” data. Together, this work provides a means to leverage the immense wealth of genomic data to prioritize novel BGCs for streamlined, targeted SM-based drug discovery.
URI:	http://hdl.handle.net/11375/32239
Appears in Collections:	Open Access Dissertations and Theses

Files in This Item:

File	Size	Format
Spencer_Norman_R_202507_PhD.pdf Embargoed until: 2026-08-06	12.08 MB	Adobe PDF	View/Open
Appendix A.pdf Embargoed until: 2026-08-06	10.03 MB	Adobe PDF	View/Open
File_A1.txt Embargoed until: 2026-08-06	111.84 kB	Text	View/Open
File_A2.txt Embargoed until: 2026-08-06	4.12 kB	Text	View/Open
File_A3.txt Embargoed until: 2026-08-06	26.23 kB	Text	View/Open
Table_A1.xlsx Embargoed until: 2026-08-06	286.94 kB	Microsoft Excel XML	View/Open
Table_A2.xlsx Embargoed until: 2026-08-06	143.75 kB	Microsoft Excel XML	View/Open
Table_A3.xlsx Embargoed until: 2026-08-06	255.44 kB	Microsoft Excel XML	View/Open
Table_A4.xlsx Embargoed until: 2026-08-06	2.62 MB	Microsoft Excel XML	View/Open
Table_A5.xlsx Embargoed until: 2026-08-06	21.54 kB	Microsoft Excel XML	View/Open
Table_A6.xlsx Embargoed until: 2026-08-06	3.28 MB	Microsoft Excel XML	View/Open
Table_A7.xlsx Embargoed until: 2026-08-06	353.33 kB	Microsoft Excel XML	View/Open
Appendix B.pdf Embargoed until: 2026-08-06	18.95 MB	Adobe PDF	View/Open
Appendix C.pdf Embargoed until: 2026-08-06	12.89 MB	Adobe PDF	View/Open
TableC1.xlsx Embargoed until: 2026-08-06	52.94 kB	Microsoft Excel XML	View/Open
TableC2.xslx.xlsx Embargoed until: 2026-08-06	12.07 kB	Microsoft Excel XML	View/Open

Show full item record