Enabling Human–Autonomy Teaming in Medical Imaging Through Transformers

Zhu, Calvin

Please use this identifier to cite or link to this item: http://hdl.handle.net/11375/32532

Title:	Enabling Human–Autonomy Teaming in Medical Imaging Through Transformers
Authors:	Zhu, Calvin
Advisor:	Noseworthy, Michael Doyle, Thomas
Department:	Biomedical Engineering
Keywords:	Deep Learning;Human Autonomy Teaming;Transformers;Convolutional Neural Networks;Machine Learning
Publication Date:	2025
Abstract:	Significant progress in deep learning has positioned highly capable, fully automated medical imaging tools on the near horizon. However, translation from research into clinical practice is slower than the rapid technological advancements. There is an alternative that is more feasible and leverages strengths from both a human operator and machine learning based tools where a decision support system is in charge of augmenting human ability. Termed human autonomy teaming (HAT), the goal is to have teams of human operators with autonomous machines work together on required tasks. The adoption of HATs has unfortunately also been slow. Interestingly, transformer-based deep learning architectures bring unique properties, the attention mechanism and generative pre-training, that may have applications across different medical imaging related tasks and further investigation may provide insights for better adoption of HATs. Conceptually, the attention mechanism informs the model to pay attention to certain portions of the input. Using this as a means to examine model decisions, the attention heads of a Vision Transformer and the important pixels for a ResNet50 model were compared to radiologist annotations to determine appropriateness of each model’s attention mechanisms in classifying two datasets. Both the ResNet50 model and the ViT model had high levels of agreement with the radiologist annotations at 88.07% and 94.85% and 94.72% and 96.96% respectively; however, the vision transformers had better performance in both test accuracy and agreement. Generative pre-training is a method for training model weights without extra labelling. During training, samples have portions masked before being given to an autoencoder which tries to recover the original input. This training method has applications in areas of medical imaging such as reconstruction and classification on imbalanced data. Adapting generative pre-training methods for reconstructing undersampled k-space data with a 25% masking produced the most faithful reconstructions with an average MSE of 6.763 E-4 and an SSIM of 0.917. Interestingly, our results are more aligned with the optimal masking ratio of language models which have an ideal masking ratio between 15-25% rather than the 75% masking ratio demonstrated in an imaging application. This reinforces the idea that further study is required to properly implement broader deep learning advances from computer vision in medical imaging. On the other hand, generative pre-training through masked autoencoders (MAEs) was investigated as a method of training on imbalanced data, achieving high single neuroimaging modality classification performance of 95.24%. Just as convolutional neural networks (CNNs) revolutionized the discipline, transformers have likewise established themselves as a defining breakthrough. Bringing many interesting avenues for exploration with them, transformers have pushed the state-of-theart and deep learning models now see use all around the world. Beyond the performance increases that transformers brought about, there is another lesson to be learned. Whatever the next ground-breaking architecture may be, researchers should consider delving into all aspects and applications rather than blindly hoping for performance improvements and fully automated tools.
URI:	http://hdl.handle.net/11375/32532
Appears in Collections:	Open Access Dissertations and Theses

Files in This Item:

File	Description	Size	Format
Zhu_Calvin_finalsubmission2025Oct_PhD.pdf Embargoed until: 2026-10-10		3.2 MB	Adobe PDF	View/Open

Show full item record