Please use this identifier to cite or link to this item:
http://hdl.handle.net/11375/32256
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Mahyar, Hamidreza | - |
dc.contributor.author | Khodadad, Mohammad | - |
dc.date.accessioned | 2025-08-27T14:00:27Z | - |
dc.date.available | 2025-08-27T14:00:27Z | - |
dc.date.issued | 2025 | - |
dc.identifier.uri | http://hdl.handle.net/11375/32256 | - |
dc.description.abstract | Large language models (LLMs) and embedding techniques have transformed general-purpose NLP, but their performance degrades on specialized scientific texts. In this thesis, we make three contributions to bridge this gap. First, we introduce two large-scale benchmark suites: ChemTEB, comprising 35 tasks on chemical corpora drawn from PubChem, CoconutDB, Safety Data Sheets, and Wikipedia; and MedTEB, comprising 51 medical tasks spanning EHR notes, PubMed abstracts, and clinical question–answer sets. Both cover classification, clustering, pair classification, retrieval, and bitext mining. Second, we propose MedTE, a 768-dimensional embedding model fine-tuned via self-supervised contrastive learning on an extensive biomedical corpus, which achieves state-of-the-art performance on MedTEB. Third, we develop GraphRAG, an automated pipeline that constructs chemical knowledge graphs from ChemRxiv preprints and generates multi-hop questions to assess compositional reasoning. Through rigorous evaluation, we show that ChemTEB reveals critical weaknesses in current chemical embeddings and that even with perfect context, LLMs achieve under 50\% accuracy on multi-hop chemistry question answering. We release all benchmarks, code, and models to foster further research in domain adaptation and compositional reasoning for specialized NLP applications. | en_US |
dc.language.iso | en | en_US |
dc.subject | Large Language Models | en_US |
dc.subject | Chemistry | en_US |
dc.subject | Biomedicine | en_US |
dc.subject | Medicine | en_US |
dc.title | DOMAIN-SPECIFIC ADAPTATION AND MULTI-HOP REASONING IN CHEMISTRY AND BIOMEDICINE | en_US |
dc.type | Thesis | en_US |
dc.contributor.department | Computational Engineering and Science | en_US |
dc.description.degreetype | Thesis | en_US |
dc.description.degree | Master of Applied Science (MASc) | en_US |
dc.description.layabstract | Large language models often excel at general text but struggle with specialized scientific language. This thesis addresses this challenge with three main contributions. First, it introduces ChemTEB and MedTEB: two benchmark collections of 35 chemistry and 51 medical tasks, respectively, covering a range of text-analysis challenges. Second, it presents MedTE, a new 768-dimensional embedding model trained to better understand biomedical language, which achieves leading results on MedTEB. Third, it describes GraphRAG, an automated system that builds chemical knowledge graphs from research preprints and generates complex, multi-step questions to test reasoning. Our experiments reveal significant gaps in current models’ grasp of scientific text, with accuracy falling below 50\% on multi-step chemistry questions. All benchmarks, code, and models are publicly released to advance research in specialized NLP. | en_US |
Appears in Collections: | Open Access Dissertations and Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Mohammad_s_thesis (9).pdf | 5.99 MB | Adobe PDF | View/Open |
Items in MacSphere are protected by copyright, with all rights reserved, unless otherwise indicated.