DOMAIN-SPECIFIC ADAPTATION AND MULTI-HOP REASONING IN CHEMISTRY AND BIOMEDICINE

Khodadad, Mohammad

Please use this identifier to cite or link to this item: http://hdl.handle.net/11375/32256

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Mahyar, Hamidreza	-
dc.contributor.author	Khodadad, Mohammad	-
dc.date.accessioned	2025-08-27T14:00:27Z	-
dc.date.available	2025-08-27T14:00:27Z	-
dc.date.issued	2025	-
dc.identifier.uri	http://hdl.handle.net/11375/32256	-
dc.description.abstract	Large language models (LLMs) and embedding techniques have transformed general-purpose NLP, but their performance degrades on specialized scientific texts. In this thesis, we make three contributions to bridge this gap. First, we introduce two large-scale benchmark suites: ChemTEB, comprising 35 tasks on chemical corpora drawn from PubChem, CoconutDB, Safety Data Sheets, and Wikipedia; and MedTEB, comprising 51 medical tasks spanning EHR notes, PubMed abstracts, and clinical question–answer sets. Both cover classification, clustering, pair classification, retrieval, and bitext mining. Second, we propose MedTE, a 768-dimensional embedding model fine-tuned via self-supervised contrastive learning on an extensive biomedical corpus, which achieves state-of-the-art performance on MedTEB. Third, we develop GraphRAG, an automated pipeline that constructs chemical knowledge graphs from ChemRxiv preprints and generates multi-hop questions to assess compositional reasoning. Through rigorous evaluation, we show that ChemTEB reveals critical weaknesses in current chemical embeddings and that even with perfect context, LLMs achieve under 50\% accuracy on multi-hop chemistry question answering. We release all benchmarks, code, and models to foster further research in domain adaptation and compositional reasoning for specialized NLP applications.	en_US
dc.language.iso	en	en_US
dc.subject	Large Language Models	en_US
dc.subject	Chemistry	en_US
dc.subject	Biomedicine	en_US
dc.subject	Medicine	en_US
dc.title	DOMAIN-SPECIFIC ADAPTATION AND MULTI-HOP REASONING IN CHEMISTRY AND BIOMEDICINE	en_US
dc.type	Thesis	en_US
dc.contributor.department	Computational Engineering and Science	en_US
dc.description.degreetype	Thesis	en_US
dc.description.degree	Master of Applied Science (MASc)	en_US
dc.description.layabstract	Large language models often excel at general text but struggle with specialized scientific language. This thesis addresses this challenge with three main contributions. First, it introduces ChemTEB and MedTEB: two benchmark collections of 35 chemistry and 51 medical tasks, respectively, covering a range of text-analysis challenges. Second, it presents MedTE, a new 768-dimensional embedding model trained to better understand biomedical language, which achieves leading results on MedTEB. Third, it describes GraphRAG, an automated system that builds chemical knowledge graphs from research preprints and generates complex, multi-step questions to test reasoning. Our experiments reveal significant gaps in current models’ grasp of scientific text, with accuracy falling below 50\% on multi-step chemistry questions. All benchmarks, code, and models are publicly released to advance research in specialized NLP.	en_US
Appears in Collections:	Open Access Dissertations and Theses

Files in This Item:

File	Description	Size	Format
Mohammad_s_thesis (9).pdf Open Access		5.99 MB	Adobe PDF	View/Open

Show simple item record