Skip navigation
  • Home
  • Browse
    • Communities
      & Collections
    • Browse Items by:
    • Publication Date
    • Author
    • Title
    • Subject
    • Department
  • Sign on to:
    • My MacSphere
    • Receive email
      updates
    • Edit Profile


McMaster University Home Page
  1. MacSphere
  2. Open Access Dissertations and Theses Community
  3. Open Access Dissertations and Theses
Please use this identifier to cite or link to this item: http://hdl.handle.net/11375/32396
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorSaha, Ashirbani-
dc.contributor.authorTasneem, Nanziba-
dc.date.accessioned2025-09-24T18:42:36Z-
dc.date.available2025-09-24T18:42:36Z-
dc.date.issued2025-
dc.identifier.urihttp://hdl.handle.net/11375/32396-
dc.description.abstractBackground: Expert- and patient-facing clinical summaries are important for patient care journeys, but creating these are time-consuming for healthcare providers. Large Language Models (LLMs) can be used for clinical text summarization; however, comprehensive evaluations are necessary prior to implementation. The objectives of this thesis paper were to (1) evaluate five LLMs [GPT-4, GPT-4o mini, Gemini 1.5 – Pro, Gemini 1.5 – Flash, and Llama 3.1] for impression generation (expert-facing) and (2) lay summaries (patient-facing), (i) using a mixed-method evaluation framework compare laypersons and a Large Reasoning Model (LRM), and (3) assess LRM (Gemini 2.5 – Pro) evaluator reliability. Methods: 100 radiology reports were sampled from the “BioNLP 2023 report summarization” dataset. Each LLM generated impressions (Chapter 2) and lay summaries (Chapter 3 and 4) using optimized prompts and hyperparameters. Impressions were evaluated by experts, the LRM and similarity metrics; lay summaries were evaluated by experts, laypersons, the LRM and readability metrics. Performance rankings were based on agreement percentages, with statistical analyses including Friedman, post-hoc Nemenyi tests, Kruskal-Wallis and Mann–Whitney U test. Results: For impression generation, Gemini 1.5 - Pro outperformed GPT-4 in coherence, comprehensiveness, and reduced medical harmfulness, despite lower cost. LRM and human evaluations had 2.15% complete disagreement. For lay summaries, Gemini 1.5 - Flash and Pro were top-rated for actionable and readable summaries requiring minimal supervision (P < 9.03×10⁻²¹). GPT-4 had the highest expert-rated accuracy (98%) while Gemini 1.5 - Pro had the best readability score. Laypersons had the highest understanding of GPT-4o mini and Gemini 1.5 - Pro summaries. LRM-layperson agreement varied by category and model. Conclusion: Gemini 1.5 - Flash and Pro consistently ranked among the top performers for impression and lay summary generation. GPT-4o mini also showed strong patient-facing characteristics. These findings highlight LLMs’ potential to improve clinical communication and the value of LRM-based evaluation frameworks.en_US
dc.language.isoenen_US
dc.subjectArtificial Intelligenceen_US
dc.subjectLarge Language Modelsen_US
dc.subjectRadiologyen_US
dc.subjectNatural Language Processingen_US
dc.subjectHealth Communicationen_US
dc.titleUSE OF LARGE LANGUAGE AND REASONING MODELS IN THE GENERATION AND EVALUATION OF EXPERT- AND PATIENT-ORIENTED SUMMARIES OF RADIOLOGY REPORT FINDINGSen_US
dc.typeThesisen_US
dc.contributor.departmenteHealthen_US
dc.description.degreetypeThesisen_US
dc.description.degreeMaster of Science (MSc)en_US
dc.description.layabstractCreating clear medical summaries for both doctors and patients is important, but it adds extra work for healthcare providers. This study explores how Artificial Intelligence (AI) models can help generate and evaluate summaries from radiology reports. Five models (Gemini 1.5 - Flash, Gemini 1.5 - Pro, GPT-4o mini, GPT-4 and Llama 3.1) were prompted to generate two types of summaries: expert summaries for clinicians and lay summaries for patients. These summaries were then evaluated by experts, laypersons (lay summaries), Gemini 2.5 - Pro (AI model) and quantitative metrics. The results show that Gemini 1.5 - Pro and GPT-4 generated coherent and accurate impressions, Gemini 1.5 - Flash and Gemini 1.5 - Pro produced lay summaries without inaccuracies and increased readability. Laypersons reported higher understanding and confidence with GPT-4o mini and Gemini 1.5 – Pro summaries. These findings show the potential of using AI to support and evaluate clinical text summarization.en_US
Appears in Collections:Open Access Dissertations and Theses

Files in This Item:
File Description SizeFormat 
Tasneem_Nanziba_finalsubmission2025September_eHealth.pdf
Embargoed until: 2026-09-10
3.14 MBAdobe PDFView/Open
Show simple item record Statistics


Items in MacSphere are protected by copyright, with all rights reserved, unless otherwise indicated.

Sherman Centre for Digital Scholarship     McMaster University Libraries
©2022 McMaster University, 1280 Main Street West, Hamilton, Ontario L8S 4L8 | 905-525-9140 | Contact Us | Terms of Use & Privacy Policy | Feedback

Report Accessibility Issue