USE OF LARGE LANGUAGE AND REASONING MODELS IN THE GENERATION AND EVALUATION OF EXPERT- AND PATIENT-ORIENTED SUMMARIES OF RADIOLOGY REPORT FINDINGS

Tasneem, Nanziba

Please use this identifier to cite or link to this item: http://hdl.handle.net/11375/32396

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Saha, Ashirbani	-
dc.contributor.author	Tasneem, Nanziba	-
dc.date.accessioned	2025-09-24T18:42:36Z	-
dc.date.available	2025-09-24T18:42:36Z	-
dc.date.issued	2025	-
dc.identifier.uri	http://hdl.handle.net/11375/32396	-
dc.description.abstract	Background: Expert- and patient-facing clinical summaries are important for patient care journeys, but creating these are time-consuming for healthcare providers. Large Language Models (LLMs) can be used for clinical text summarization; however, comprehensive evaluations are necessary prior to implementation. The objectives of this thesis paper were to (1) evaluate five LLMs [GPT-4, GPT-4o mini, Gemini 1.5 – Pro, Gemini 1.5 – Flash, and Llama 3.1] for impression generation (expert-facing) and (2) lay summaries (patient-facing), (i) using a mixed-method evaluation framework compare laypersons and a Large Reasoning Model (LRM), and (3) assess LRM (Gemini 2.5 – Pro) evaluator reliability. Methods: 100 radiology reports were sampled from the “BioNLP 2023 report summarization” dataset. Each LLM generated impressions (Chapter 2) and lay summaries (Chapter 3 and 4) using optimized prompts and hyperparameters. Impressions were evaluated by experts, the LRM and similarity metrics; lay summaries were evaluated by experts, laypersons, the LRM and readability metrics. Performance rankings were based on agreement percentages, with statistical analyses including Friedman, post-hoc Nemenyi tests, Kruskal-Wallis and Mann–Whitney U test. Results: For impression generation, Gemini 1.5 - Pro outperformed GPT-4 in coherence, comprehensiveness, and reduced medical harmfulness, despite lower cost. LRM and human evaluations had 2.15% complete disagreement. For lay summaries, Gemini 1.5 - Flash and Pro were top-rated for actionable and readable summaries requiring minimal supervision (P < 9.03×10⁻²¹). GPT-4 had the highest expert-rated accuracy (98%) while Gemini 1.5 - Pro had the best readability score. Laypersons had the highest understanding of GPT-4o mini and Gemini 1.5 - Pro summaries. LRM-layperson agreement varied by category and model. Conclusion: Gemini 1.5 - Flash and Pro consistently ranked among the top performers for impression and lay summary generation. GPT-4o mini also showed strong patient-facing characteristics. These findings highlight LLMs’ potential to improve clinical communication and the value of LRM-based evaluation frameworks.	en_US
dc.language.iso	en	en_US
dc.subject	Artificial Intelligence	en_US
dc.subject	Large Language Models	en_US
dc.subject	Radiology	en_US
dc.subject	Natural Language Processing	en_US
dc.subject	Health Communication	en_US
dc.title	USE OF LARGE LANGUAGE AND REASONING MODELS IN THE GENERATION AND EVALUATION OF EXPERT- AND PATIENT-ORIENTED SUMMARIES OF RADIOLOGY REPORT FINDINGS	en_US
dc.type	Thesis	en_US
dc.contributor.department	eHealth	en_US
dc.description.degreetype	Thesis	en_US
dc.description.degree	Master of Science (MSc)	en_US
dc.description.layabstract	Creating clear medical summaries for both doctors and patients is important, but it adds extra work for healthcare providers. This study explores how Artificial Intelligence (AI) models can help generate and evaluate summaries from radiology reports. Five models (Gemini 1.5 - Flash, Gemini 1.5 - Pro, GPT-4o mini, GPT-4 and Llama 3.1) were prompted to generate two types of summaries: expert summaries for clinicians and lay summaries for patients. These summaries were then evaluated by experts, laypersons (lay summaries), Gemini 2.5 - Pro (AI model) and quantitative metrics. The results show that Gemini 1.5 - Pro and GPT-4 generated coherent and accurate impressions, Gemini 1.5 - Flash and Gemini 1.5 - Pro produced lay summaries without inaccuracies and increased readability. Laypersons reported higher understanding and confidence with GPT-4o mini and Gemini 1.5 – Pro summaries. These findings show the potential of using AI to support and evaluate clinical text summarization.	en_US
Appears in Collections:	Open Access Dissertations and Theses

Files in This Item:

File	Description	Size	Format
Tasneem_Nanziba_finalsubmission2025September_eHealth.pdf Embargoed until: 2026-09-10		3.14 MB	Adobe PDF	View/Open

Show simple item record