Welcome to the upgraded MacSphere! We're putting the finishing touches on it; if you notice anything amiss, email macsphere@mcmaster.ca

SciRAG: A Retrieval-Focused Fine-Tuning Strategy for Scientific Documents

dc.contributor.advisorThia, Kirubarajan
dc.contributor.authorVasantharajan, Charangan
dc.contributor.departmentElectrical and Computer Engineeringen_US
dc.date.accessioned2025-05-01T10:44:39Z
dc.date.available2025-05-01T10:44:39Z
dc.date.issued2025
dc.description.abstractLarge Language Models (LLMs) have achieved remarkable success in general-purpose natural language understanding and generation. However, their effectiveness diminishes in scientific and technical domains, where documents contain dense mathematical notation, complex layouts, and specialized terminology. These characteristics pose significant challenges for traditional LLM pipelines, often resulting in hallucinated outputs, misinterpretation of formulas, and failures in retrieving relevant context. This thesis introduces SciRAG, a Retrieval-Focused Fine-Tuning Strategy designed specifically for scientific documents. SciRAG combines structure-preserving document parsing, context-aware chunking, and domain-adapted fine-tuning using Low-Rank Adaptation (LoRA) to enhance an LLM's ability to understand and generate scientifically accurate content. The system incorporates a custom Retrieval-Augmented Generation (RAG) framework that supports semantic alignment of mathematical expressions and technical language across large corpora. Experimental evaluations demonstrate that SciRAG achieves strong performance in scientific question answering and mathematical reasoning. Notably, the model attains 70% accuracy on the GSM8k benchmark, alongside high retrieval and generation quality, achieving a Context Recall score of 0.85, Factual Correctness of 0.45, Faithfulness of 0.45, and Semantic Similarity of 0.94. These results underscore SciRAG’s effectiveness in bridging the gap between general-purpose LLMs and domain-specific, mathematically grounded language understanding.en_US
dc.description.degreeMaster of Applied Science (MASc)en_US
dc.description.degreetypeThesisen_US
dc.identifier.urihttp://hdl.handle.net/11375/31596
dc.language.isoenen_US
dc.subjectRetrieval-Augmented Generation (RAG)en_US
dc.subjectScientific Document Processingen_US
dc.subjectLarge Language Models (LLMs)en_US
dc.subjectDomain Adaptationen_US
dc.subjectScientific Text Understandingen_US
dc.subjectLaTeX Handlingen_US
dc.titleSciRAG: A Retrieval-Focused Fine-Tuning Strategy for Scientific Documentsen_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Vasantharajan_Charangan_202504_MASc.pdf
Size:
5.91 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.68 KB
Format:
Item-specific license agreed upon to submission
Description: