Please use this identifier to cite or link to this item:
http://hdl.handle.net/11375/32396
Title: | USE OF LARGE LANGUAGE AND REASONING MODELS IN THE GENERATION AND EVALUATION OF EXPERT- AND PATIENT-ORIENTED SUMMARIES OF RADIOLOGY REPORT FINDINGS |
Authors: | Tasneem, Nanziba |
Advisor: | Saha, Ashirbani |
Department: | eHealth |
Keywords: | Artificial Intelligence;Large Language Models;Radiology;Natural Language Processing;Health Communication |
Publication Date: | 2025 |
Abstract: | Background: Expert- and patient-facing clinical summaries are important for patient care journeys, but creating these are time-consuming for healthcare providers. Large Language Models (LLMs) can be used for clinical text summarization; however, comprehensive evaluations are necessary prior to implementation. The objectives of this thesis paper were to (1) evaluate five LLMs [GPT-4, GPT-4o mini, Gemini 1.5 – Pro, Gemini 1.5 – Flash, and Llama 3.1] for impression generation (expert-facing) and (2) lay summaries (patient-facing), (i) using a mixed-method evaluation framework compare laypersons and a Large Reasoning Model (LRM), and (3) assess LRM (Gemini 2.5 – Pro) evaluator reliability. Methods: 100 radiology reports were sampled from the “BioNLP 2023 report summarization” dataset. Each LLM generated impressions (Chapter 2) and lay summaries (Chapter 3 and 4) using optimized prompts and hyperparameters. Impressions were evaluated by experts, the LRM and similarity metrics; lay summaries were evaluated by experts, laypersons, the LRM and readability metrics. Performance rankings were based on agreement percentages, with statistical analyses including Friedman, post-hoc Nemenyi tests, Kruskal-Wallis and Mann–Whitney U test. Results: For impression generation, Gemini 1.5 - Pro outperformed GPT-4 in coherence, comprehensiveness, and reduced medical harmfulness, despite lower cost. LRM and human evaluations had 2.15% complete disagreement. For lay summaries, Gemini 1.5 - Flash and Pro were top-rated for actionable and readable summaries requiring minimal supervision (P < 9.03×10⁻²¹). GPT-4 had the highest expert-rated accuracy (98%) while Gemini 1.5 - Pro had the best readability score. Laypersons had the highest understanding of GPT-4o mini and Gemini 1.5 - Pro summaries. LRM-layperson agreement varied by category and model. Conclusion: Gemini 1.5 - Flash and Pro consistently ranked among the top performers for impression and lay summary generation. GPT-4o mini also showed strong patient-facing characteristics. These findings highlight LLMs’ potential to improve clinical communication and the value of LRM-based evaluation frameworks. |
URI: | http://hdl.handle.net/11375/32396 |
Appears in Collections: | Open Access Dissertations and Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Tasneem_Nanziba_finalsubmission2025September_eHealth.pdf | 3.14 MB | Adobe PDF | View/Open |
Items in MacSphere are protected by copyright, with all rights reserved, unless otherwise indicated.