Skip navigation
  • Home
  • Browse
    • Communities
      & Collections
    • Browse Items by:
    • Publication Date
    • Author
    • Title
    • Subject
    • Department
  • Sign on to:
    • My MacSphere
    • Receive email
      updates
    • Edit Profile


McMaster University Home Page
  1. MacSphere
  2. Open Access Dissertations and Theses Community
  3. Open Access Dissertations and Theses
Please use this identifier to cite or link to this item: http://hdl.handle.net/11375/32396
Title: USE OF LARGE LANGUAGE AND REASONING MODELS IN THE GENERATION AND EVALUATION OF EXPERT- AND PATIENT-ORIENTED SUMMARIES OF RADIOLOGY REPORT FINDINGS
Authors: Tasneem, Nanziba
Advisor: Saha, Ashirbani
Department: eHealth
Keywords: Artificial Intelligence;Large Language Models;Radiology;Natural Language Processing;Health Communication
Publication Date: 2025
Abstract: Background: Expert- and patient-facing clinical summaries are important for patient care journeys, but creating these are time-consuming for healthcare providers. Large Language Models (LLMs) can be used for clinical text summarization; however, comprehensive evaluations are necessary prior to implementation. The objectives of this thesis paper were to (1) evaluate five LLMs [GPT-4, GPT-4o mini, Gemini 1.5 – Pro, Gemini 1.5 – Flash, and Llama 3.1] for impression generation (expert-facing) and (2) lay summaries (patient-facing), (i) using a mixed-method evaluation framework compare laypersons and a Large Reasoning Model (LRM), and (3) assess LRM (Gemini 2.5 – Pro) evaluator reliability. Methods: 100 radiology reports were sampled from the “BioNLP 2023 report summarization” dataset. Each LLM generated impressions (Chapter 2) and lay summaries (Chapter 3 and 4) using optimized prompts and hyperparameters. Impressions were evaluated by experts, the LRM and similarity metrics; lay summaries were evaluated by experts, laypersons, the LRM and readability metrics. Performance rankings were based on agreement percentages, with statistical analyses including Friedman, post-hoc Nemenyi tests, Kruskal-Wallis and Mann–Whitney U test. Results: For impression generation, Gemini 1.5 - Pro outperformed GPT-4 in coherence, comprehensiveness, and reduced medical harmfulness, despite lower cost. LRM and human evaluations had 2.15% complete disagreement. For lay summaries, Gemini 1.5 - Flash and Pro were top-rated for actionable and readable summaries requiring minimal supervision (P < 9.03×10⁻²¹). GPT-4 had the highest expert-rated accuracy (98%) while Gemini 1.5 - Pro had the best readability score. Laypersons had the highest understanding of GPT-4o mini and Gemini 1.5 - Pro summaries. LRM-layperson agreement varied by category and model. Conclusion: Gemini 1.5 - Flash and Pro consistently ranked among the top performers for impression and lay summary generation. GPT-4o mini also showed strong patient-facing characteristics. These findings highlight LLMs’ potential to improve clinical communication and the value of LRM-based evaluation frameworks.
URI: http://hdl.handle.net/11375/32396
Appears in Collections:Open Access Dissertations and Theses

Files in This Item:
File Description SizeFormat 
Tasneem_Nanziba_finalsubmission2025September_eHealth.pdf
Embargoed until: 2026-09-10
3.14 MBAdobe PDFView/Open
Show full item record Statistics


Items in MacSphere are protected by copyright, with all rights reserved, unless otherwise indicated.

Sherman Centre for Digital Scholarship     McMaster University Libraries
©2022 McMaster University, 1280 Main Street West, Hamilton, Ontario L8S 4L8 | 905-525-9140 | Contact Us | Terms of Use & Privacy Policy | Feedback

Report Accessibility Issue