DOMAIN-SPECIFIC ADAPTATION AND MULTI-HOP REASONING IN CHEMISTRY AND BIOMEDICINE

Khodadad, Mohammad

DOMAIN-SPECIFIC ADAPTATION AND MULTI-HOP REASONING IN CHEMISTRY AND BIOMEDICINE

dc.contributor.advisor	Mahyar, Hamidreza
dc.contributor.author	Khodadad, Mohammad
dc.contributor.department	Computational Engineering and Science	en_US
dc.date.accessioned	2025-08-27T14:00:27Z
dc.date.available	2025-08-27T14:00:27Z
dc.date.issued	2025
dc.description.abstract	Large language models (LLMs) and embedding techniques have transformed general-purpose NLP, but their performance degrades on specialized scientific texts. In this thesis, we make three contributions to bridge this gap. First, we introduce two large-scale benchmark suites: ChemTEB, comprising 35 tasks on chemical corpora drawn from PubChem, CoconutDB, Safety Data Sheets, and Wikipedia; and MedTEB, comprising 51 medical tasks spanning EHR notes, PubMed abstracts, and clinical question–answer sets. Both cover classification, clustering, pair classification, retrieval, and bitext mining. Second, we propose MedTE, a 768-dimensional embedding model fine-tuned via self-supervised contrastive learning on an extensive biomedical corpus, which achieves state-of-the-art performance on MedTEB. Third, we develop GraphRAG, an automated pipeline that constructs chemical knowledge graphs from ChemRxiv preprints and generates multi-hop questions to assess compositional reasoning. Through rigorous evaluation, we show that ChemTEB reveals critical weaknesses in current chemical embeddings and that even with perfect context, LLMs achieve under 50\% accuracy on multi-hop chemistry question answering. We release all benchmarks, code, and models to foster further research in domain adaptation and compositional reasoning for specialized NLP applications.	en_US
dc.description.degree	Master of Applied Science (MASc)	en_US
dc.description.degreetype	Thesis	en_US
dc.description.layabstract	Large language models often excel at general text but struggle with specialized scientific language. This thesis addresses this challenge with three main contributions. First, it introduces ChemTEB and MedTEB: two benchmark collections of 35 chemistry and 51 medical tasks, respectively, covering a range of text-analysis challenges. Second, it presents MedTE, a new 768-dimensional embedding model trained to better understand biomedical language, which achieves leading results on MedTEB. Third, it describes GraphRAG, an automated system that builds chemical knowledge graphs from research preprints and generates complex, multi-step questions to test reasoning. Our experiments reveal significant gaps in current models’ grasp of scientific text, with accuracy falling below 50\% on multi-step chemistry questions. All benchmarks, code, and models are publicly released to advance research in specialized NLP.	en_US
dc.identifier.uri	http://hdl.handle.net/11375/32256
dc.language.iso	en	en_US
dc.subject	Large Language Models	en_US
dc.subject	Chemistry	en_US
dc.subject	Biomedicine	en_US
dc.subject	Medicine	en_US
dc.title	DOMAIN-SPECIFIC ADAPTATION AND MULTI-HOP REASONING IN CHEMISTRY AND BIOMEDICINE	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Mohammad_s_thesis (9).pdf
Size:: 5.85 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.68 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Open Access Dissertations and Theses