Welcome to the upgraded MacSphere! We're putting the finishing touches on it; if you notice anything amiss, email macsphere@mcmaster.ca

DOMAIN-SPECIFIC ADAPTATION AND MULTI-HOP REASONING IN CHEMISTRY AND BIOMEDICINE

dc.contributor.advisorMahyar, Hamidreza
dc.contributor.authorKhodadad, Mohammad
dc.contributor.departmentComputational Engineering and Scienceen_US
dc.date.accessioned2025-08-27T14:00:27Z
dc.date.available2025-08-27T14:00:27Z
dc.date.issued2025
dc.description.abstractLarge language models (LLMs) and embedding techniques have transformed general-purpose NLP, but their performance degrades on specialized scientific texts. In this thesis, we make three contributions to bridge this gap. First, we introduce two large-scale benchmark suites: ChemTEB, comprising 35 tasks on chemical corpora drawn from PubChem, CoconutDB, Safety Data Sheets, and Wikipedia; and MedTEB, comprising 51 medical tasks spanning EHR notes, PubMed abstracts, and clinical question–answer sets. Both cover classification, clustering, pair classification, retrieval, and bitext mining. Second, we propose MedTE, a 768-dimensional embedding model fine-tuned via self-supervised contrastive learning on an extensive biomedical corpus, which achieves state-of-the-art performance on MedTEB. Third, we develop GraphRAG, an automated pipeline that constructs chemical knowledge graphs from ChemRxiv preprints and generates multi-hop questions to assess compositional reasoning. Through rigorous evaluation, we show that ChemTEB reveals critical weaknesses in current chemical embeddings and that even with perfect context, LLMs achieve under 50\% accuracy on multi-hop chemistry question answering. We release all benchmarks, code, and models to foster further research in domain adaptation and compositional reasoning for specialized NLP applications.en_US
dc.description.degreeMaster of Applied Science (MASc)en_US
dc.description.degreetypeThesisen_US
dc.description.layabstractLarge language models often excel at general text but struggle with specialized scientific language. This thesis addresses this challenge with three main contributions. First, it introduces ChemTEB and MedTEB: two benchmark collections of 35 chemistry and 51 medical tasks, respectively, covering a range of text-analysis challenges. Second, it presents MedTE, a new 768-dimensional embedding model trained to better understand biomedical language, which achieves leading results on MedTEB. Third, it describes GraphRAG, an automated system that builds chemical knowledge graphs from research preprints and generates complex, multi-step questions to test reasoning. Our experiments reveal significant gaps in current models’ grasp of scientific text, with accuracy falling below 50\% on multi-step chemistry questions. All benchmarks, code, and models are publicly released to advance research in specialized NLP.en_US
dc.identifier.urihttp://hdl.handle.net/11375/32256
dc.language.isoenen_US
dc.subjectLarge Language Modelsen_US
dc.subjectChemistryen_US
dc.subjectBiomedicineen_US
dc.subjectMedicineen_US
dc.titleDOMAIN-SPECIFIC ADAPTATION AND MULTI-HOP REASONING IN CHEMISTRY AND BIOMEDICINEen_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Mohammad_s_thesis (9).pdf
Size:
5.85 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.68 KB
Format:
Item-specific license agreed upon to submission
Description: