Please use this identifier to cite or link to this item:
http://hdl.handle.net/11375/32255
Title: | Domain-Specific Text Embedding Models for Information Retrieval |
Authors: | Shiraee Kasmaee, Ali |
Advisor: | Mahyar, Hamidreza |
Department: | Computational Engineering and Science |
Keywords: | Deep Learning;Natural Language Processing;Information Retrieval |
Publication Date: | 2025 |
Abstract: | Large Language Models (LLMs) have shown advanced capabilities across various fields. However, using these models out of the box, especially in specialized domains like chemistry, often leads to issues such as context limitations, hallucinations, difficulty updating their parametric knowledge, and unclear sources for generated responses. To tackle these challenges, Retrieval-Augmented Generation (RAG) enables language models to use external knowledge sources during inference, improving factual accuracy and allowing dynamic knowledge retrieval without costly retraining. A critical part of any RAG system is the text embedding model, which searches for the most relevant documents in a knowledge base given a query. However, standard embedding models trained on general datasets perform poorly in chemistry because of the field's unique vocabulary, specialized terms, and complex semantics. This thesis introduces the Chemical Text Embedding Benchmark (ChemTEB), designed specifically to evaluate embedding models on chemical tasks. ChemTEB systematically measures models' performance, clearly identifying their strengths and weaknesses when handling chemistry-related texts. Using insights from ChemTEB, we developed ChEmbed, a family of text embedding models fine-tuned specifically for chemistry. To achieve this, we gathered domain-specific text from chemistry research articles and public resources. We then generated synthetic queries for these texts using LLMs, creating realistic query-passage pairs for training. This approach effectively addresses issues of limited data availability and improves the model’s ability to represent chemistry-specific language. Additionally, we introduced a new domain-specific tokenizer that efficiently integrates chemical terms into an existing pretrained model, enhancing the accuracy of text representations. Together, ChemTEB and ChEmbed offer the first domain-adapted solution for chemical text retrieval, overcoming the performance limitations of general embedding models. This contributes to more accurate and interpretable AI-based chemical research and discovery. Although the focus here is chemistry, the methods can serve as a practical framework for adapting embedding models to other specialized fields. |
URI: | http://hdl.handle.net/11375/32255 |
Appears in Collections: | Open Access Dissertations and Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Shiraee Kasmaee_Ali_202508_masc.pdf | 2.63 MB | Adobe PDF | View/Open |
Items in MacSphere are protected by copyright, with all rights reserved, unless otherwise indicated.