Domain-Specific Text Embedding Models for Information Retrieval

Shiraee Kasmaee, Ali

Please use this identifier to cite or link to this item: http://hdl.handle.net/11375/32255

Title:	Domain-Specific Text Embedding Models for Information Retrieval
Authors:	Shiraee Kasmaee, Ali
Advisor:	Mahyar, Hamidreza
Department:	Computational Engineering and Science
Keywords:	Deep Learning;Natural Language Processing;Information Retrieval
Publication Date:	2025
Abstract:	Large Language Models (LLMs) have shown advanced capabilities across various fields. However, using these models out of the box, especially in specialized domains like chemistry, often leads to issues such as context limitations, hallucinations, difficulty updating their parametric knowledge, and unclear sources for generated responses. To tackle these challenges, Retrieval-Augmented Generation (RAG) enables language models to use external knowledge sources during inference, improving factual accuracy and allowing dynamic knowledge retrieval without costly retraining. A critical part of any RAG system is the text embedding model, which searches for the most relevant documents in a knowledge base given a query. However, standard embedding models trained on general datasets perform poorly in chemistry because of the field's unique vocabulary, specialized terms, and complex semantics. This thesis introduces the Chemical Text Embedding Benchmark (ChemTEB), designed specifically to evaluate embedding models on chemical tasks. ChemTEB systematically measures models' performance, clearly identifying their strengths and weaknesses when handling chemistry-related texts. Using insights from ChemTEB, we developed ChEmbed, a family of text embedding models fine-tuned specifically for chemistry. To achieve this, we gathered domain-specific text from chemistry research articles and public resources. We then generated synthetic queries for these texts using LLMs, creating realistic query-passage pairs for training. This approach effectively addresses issues of limited data availability and improves the model’s ability to represent chemistry-specific language. Additionally, we introduced a new domain-specific tokenizer that efficiently integrates chemical terms into an existing pretrained model, enhancing the accuracy of text representations. Together, ChemTEB and ChEmbed offer the first domain-adapted solution for chemical text retrieval, overcoming the performance limitations of general embedding models. This contributes to more accurate and interpretable AI-based chemical research and discovery. Although the focus here is chemistry, the methods can serve as a practical framework for adapting embedding models to other specialized fields.
URI:	http://hdl.handle.net/11375/32255
Appears in Collections:	Open Access Dissertations and Theses

Files in This Item:

File	Description	Size	Format
Shiraee Kasmaee_Ali_202508_masc.pdf Open Access		2.63 MB	Adobe PDF	View/Open

Show full item record