USE OF LARGE LANGUAGE AND REASONING MODELS IN THE GENERATION AND EVALUATION OF EXPERT- AND PATIENT-ORIENTED SUMMARIES OF RADIOLOGY REPORT FINDINGS

Tasneem, Nanziba

Please use this identifier to cite or link to this item: http://hdl.handle.net/11375/32396

Title:	USE OF LARGE LANGUAGE AND REASONING MODELS IN THE GENERATION AND EVALUATION OF EXPERT- AND PATIENT-ORIENTED SUMMARIES OF RADIOLOGY REPORT FINDINGS
Authors:	Tasneem, Nanziba
Advisor:	Saha, Ashirbani
Department:	eHealth
Keywords:	Artificial Intelligence;Large Language Models;Radiology;Natural Language Processing;Health Communication
Publication Date:	2025
Abstract:	Background: Expert- and patient-facing clinical summaries are important for patient care journeys, but creating these are time-consuming for healthcare providers. Large Language Models (LLMs) can be used for clinical text summarization; however, comprehensive evaluations are necessary prior to implementation. The objectives of this thesis paper were to (1) evaluate five LLMs [GPT-4, GPT-4o mini, Gemini 1.5 – Pro, Gemini 1.5 – Flash, and Llama 3.1] for impression generation (expert-facing) and (2) lay summaries (patient-facing), (i) using a mixed-method evaluation framework compare laypersons and a Large Reasoning Model (LRM), and (3) assess LRM (Gemini 2.5 – Pro) evaluator reliability. Methods: 100 radiology reports were sampled from the “BioNLP 2023 report summarization” dataset. Each LLM generated impressions (Chapter 2) and lay summaries (Chapter 3 and 4) using optimized prompts and hyperparameters. Impressions were evaluated by experts, the LRM and similarity metrics; lay summaries were evaluated by experts, laypersons, the LRM and readability metrics. Performance rankings were based on agreement percentages, with statistical analyses including Friedman, post-hoc Nemenyi tests, Kruskal-Wallis and Mann–Whitney U test. Results: For impression generation, Gemini 1.5 - Pro outperformed GPT-4 in coherence, comprehensiveness, and reduced medical harmfulness, despite lower cost. LRM and human evaluations had 2.15% complete disagreement. For lay summaries, Gemini 1.5 - Flash and Pro were top-rated for actionable and readable summaries requiring minimal supervision (P < 9.03×10⁻²¹). GPT-4 had the highest expert-rated accuracy (98%) while Gemini 1.5 - Pro had the best readability score. Laypersons had the highest understanding of GPT-4o mini and Gemini 1.5 - Pro summaries. LRM-layperson agreement varied by category and model. Conclusion: Gemini 1.5 - Flash and Pro consistently ranked among the top performers for impression and lay summary generation. GPT-4o mini also showed strong patient-facing characteristics. These findings highlight LLMs’ potential to improve clinical communication and the value of LRM-based evaluation frameworks.
URI:	http://hdl.handle.net/11375/32396
Appears in Collections:	Open Access Dissertations and Theses

Files in This Item:

File	Description	Size	Format
Tasneem_Nanziba_finalsubmission2025September_eHealth.pdf Embargoed until: 2026-09-10		3.14 MB	Adobe PDF	View/Open

Show full item record