Please use this identifier to cite or link to this item:
http://hdl.handle.net/11375/31596
Title: | SciRAG: A Retrieval-Focused Fine-Tuning Strategy for Scientific Documents |
Authors: | Vasantharajan, Charangan |
Advisor: | Thia, Kirubarajan |
Department: | Electrical and Computer Engineering |
Keywords: | Retrieval-Augmented Generation (RAG);Scientific Document Processing;Large Language Models (LLMs);Domain Adaptation;Scientific Text Understanding;LaTeX Handling |
Publication Date: | 2025 |
Abstract: | Large Language Models (LLMs) have achieved remarkable success in general-purpose natural language understanding and generation. However, their effectiveness diminishes in scientific and technical domains, where documents contain dense mathematical notation, complex layouts, and specialized terminology. These characteristics pose significant challenges for traditional LLM pipelines, often resulting in hallucinated outputs, misinterpretation of formulas, and failures in retrieving relevant context. This thesis introduces SciRAG, a Retrieval-Focused Fine-Tuning Strategy designed specifically for scientific documents. SciRAG combines structure-preserving document parsing, context-aware chunking, and domain-adapted fine-tuning using Low-Rank Adaptation (LoRA) to enhance an LLM's ability to understand and generate scientifically accurate content. The system incorporates a custom Retrieval-Augmented Generation (RAG) framework that supports semantic alignment of mathematical expressions and technical language across large corpora. Experimental evaluations demonstrate that SciRAG achieves strong performance in scientific question answering and mathematical reasoning. Notably, the model attains 70% accuracy on the GSM8k benchmark, alongside high retrieval and generation quality, achieving a Context Recall score of 0.85, Factual Correctness of 0.45, Faithfulness of 0.45, and Semantic Similarity of 0.94. These results underscore SciRAG’s effectiveness in bridging the gap between general-purpose LLMs and domain-specific, mathematically grounded language understanding. |
URI: | http://hdl.handle.net/11375/31596 |
Appears in Collections: | Open Access Dissertations and Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Vasantharajan_Charangan_202504_MASc.pdf | 6.05 MB | Adobe PDF | View/Open |
Items in MacSphere are protected by copyright, with all rights reserved, unless otherwise indicated.