SciRAG: A Retrieval-Focused Fine-Tuning Strategy for Scientific Documents

Vasantharajan, Charangan

Please use this identifier to cite or link to this item: http://hdl.handle.net/11375/31596

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Thia, Kirubarajan	-
dc.contributor.author	Vasantharajan, Charangan	-
dc.date.accessioned	2025-05-01T10:44:39Z	-
dc.date.available	2025-05-01T10:44:39Z	-
dc.date.issued	2025	-
dc.identifier.uri	http://hdl.handle.net/11375/31596	-
dc.description.abstract	Large Language Models (LLMs) have achieved remarkable success in general-purpose natural language understanding and generation. However, their effectiveness diminishes in scientific and technical domains, where documents contain dense mathematical notation, complex layouts, and specialized terminology. These characteristics pose significant challenges for traditional LLM pipelines, often resulting in hallucinated outputs, misinterpretation of formulas, and failures in retrieving relevant context. This thesis introduces SciRAG, a Retrieval-Focused Fine-Tuning Strategy designed specifically for scientific documents. SciRAG combines structure-preserving document parsing, context-aware chunking, and domain-adapted fine-tuning using Low-Rank Adaptation (LoRA) to enhance an LLM's ability to understand and generate scientifically accurate content. The system incorporates a custom Retrieval-Augmented Generation (RAG) framework that supports semantic alignment of mathematical expressions and technical language across large corpora. Experimental evaluations demonstrate that SciRAG achieves strong performance in scientific question answering and mathematical reasoning. Notably, the model attains 70% accuracy on the GSM8k benchmark, alongside high retrieval and generation quality, achieving a Context Recall score of 0.85, Factual Correctness of 0.45, Faithfulness of 0.45, and Semantic Similarity of 0.94. These results underscore SciRAG’s effectiveness in bridging the gap between general-purpose LLMs and domain-specific, mathematically grounded language understanding.	en_US
dc.language.iso	en	en_US
dc.subject	Retrieval-Augmented Generation (RAG)	en_US
dc.subject	Scientific Document Processing	en_US
dc.subject	Large Language Models (LLMs)	en_US
dc.subject	Domain Adaptation	en_US
dc.subject	Scientific Text Understanding	en_US
dc.subject	LaTeX Handling	en_US
dc.title	SciRAG: A Retrieval-Focused Fine-Tuning Strategy for Scientific Documents	en_US
dc.type	Thesis	en_US
dc.contributor.department	Electrical and Computer Engineering	en_US
dc.description.degreetype	Thesis	en_US
dc.description.degree	Master of Applied Science (MASc)	en_US
Appears in Collections:	Open Access Dissertations and Theses

Files in This Item:

File	Description	Size	Format
Vasantharajan_Charangan_202504_MASc.pdf Open Access		6.05 MB	Adobe PDF	View/Open

Show simple item record