GENERATIVE LARGE LANGUAGE MODELS FOR TRANSPARENT ARTIFICIAL INTELLIGENCE IN CLINICAL RESEARCH: ENHANCING INTERPRETABILITY THROUGH APPRAISAL AND EXPLANATION

Zhou, Fangwen

GENERATIVE LARGE LANGUAGE MODELS FOR TRANSPARENT ARTIFICIAL INTELLIGENCE IN CLINICAL RESEARCH: ENHANCING INTERPRETABILITY THROUGH APPRAISAL AND EXPLANATION

Files

Zhou_Fangwen_2025Jun_MSc.pdf (1.95 MB)

Date

2025

Authors

Zhou, Fangwen

Abstract

Background The rapid growth of medical literature necessitates effective, transparent automation tools for classification. Generative large language models (LLMs), including the Generative Pre-trained Transformer (GPT), have the potential to provide transparent classification and explain other black box models. Objective This sandwich thesis evaluates the performance of GPT in 1) classifying biomedical literature compared with a fine-tuned BioLinkBERT model, and 2) explaining the decision of encoder-only models with feature attributions compared to traditional eXplainable AI (XAI) frameworks like SHapley Additive exPlanations (SHAP) and integrated gradients (IG). Methods Randomly sampled, manually annotated clinical research articles from the Health Information Research Unit (HIRU) were used along with a top-performing BioLinkBERT classifier. In Chapter 2, GPT-4o and GPT-o3-mini were used either alone or with BioLinkBERT’s predictions in the prompt to classify article methodological rigour based on HIRU’s criteria. Either the title and abstract or the full text was provided to GPT. Performance was compared to the BioLinkBERT model and assessed primarily using Matthew’s correlation coefficient (MCC). In Chapter 3, GPT-4o was used to generate feature attributions for the BioLinkBERT model through masking perturbations and was compared to SHAP and IG using a modified area under the perturbation curve (AOPC) metric which gives a measure of performance. Results GPT-4o alone, using full text (MCC 0.429), achieved comparable classification performance to BioLinkBERT (MCC 0.466). Performance was worse with other models and inputs. As a perturbation explainer, GPT-4o’s (AOPC 0.029) performance was poor and significantly underperformed compared to SHAP (AOPC 0.222) and IG (AOPC 0.225). The identified important tokens by GPT did not align with the manual appraisal criteria. Conclusion GPT has potential in appraising biomedical literature, even without explicit training. GPT’s transparency through textual explanations improves interpretability. GPT’s poor performance in generating faithful feature attributions warrants future research. The inherent variability and stochasticity of GPT outputs necessitate careful prompting and reproducibility measures.

Keywords

Artificial intelligence, Natural language processing, Explainable AI, Large language models, Machine learning, Deep learning, Transformers, GPT, Evidence-based Medicine, Knowledge Translation

URI

http://hdl.handle.net/11375/31871

Collections

Open Access Dissertations and Theses

Full item page

GENERATIVE LARGE LANGUAGE MODELS FOR TRANSPARENT ARTIFICIAL INTELLIGENCE IN CLINICAL RESEARCH: ENHANCING INTERPRETABILITY THROUGH APPRAISAL AND EXPLANATION

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By