Welcome to the upgraded MacSphere! We're putting the finishing touches on it; if you notice anything amiss, email macsphere@mcmaster.ca

SciRAG: A Retrieval-Focused Fine-Tuning Strategy for Scientific Documents

Loading...
Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Large Language Models (LLMs) have achieved remarkable success in general-purpose natural language understanding and generation. However, their effectiveness diminishes in scientific and technical domains, where documents contain dense mathematical notation, complex layouts, and specialized terminology. These characteristics pose significant challenges for traditional LLM pipelines, often resulting in hallucinated outputs, misinterpretation of formulas, and failures in retrieving relevant context. This thesis introduces SciRAG, a Retrieval-Focused Fine-Tuning Strategy designed specifically for scientific documents. SciRAG combines structure-preserving document parsing, context-aware chunking, and domain-adapted fine-tuning using Low-Rank Adaptation (LoRA) to enhance an LLM's ability to understand and generate scientifically accurate content. The system incorporates a custom Retrieval-Augmented Generation (RAG) framework that supports semantic alignment of mathematical expressions and technical language across large corpora. Experimental evaluations demonstrate that SciRAG achieves strong performance in scientific question answering and mathematical reasoning. Notably, the model attains 70% accuracy on the GSM8k benchmark, alongside high retrieval and generation quality, achieving a Context Recall score of 0.85, Factual Correctness of 0.45, Faithfulness of 0.45, and Semantic Similarity of 0.94. These results underscore SciRAG’s effectiveness in bridging the gap between general-purpose LLMs and domain-specific, mathematically grounded language understanding.

Description

Citation

Endorsement

Review

Supplemented By

Referenced By