Enabling Human–Autonomy Teaming in Medical Imaging Through Transformers
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Significant progress in deep learning has positioned highly capable, fully automated
medical imaging tools on the near horizon. However, translation from research into
clinical practice is slower than the rapid technological advancements. There is an alternative
that is more feasible and leverages strengths from both a human operator and
machine learning based tools where a decision support system is in charge of augmenting
human ability. Termed human autonomy teaming (HAT), the goal is to have teams of
human operators with autonomous machines work together on required tasks. The adoption
of HATs has unfortunately also been slow. Interestingly, transformer-based deep
learning architectures bring unique properties, the attention mechanism and generative
pre-training, that may have applications across different medical imaging related tasks
and further investigation may provide insights for better adoption of HATs.
Conceptually, the attention mechanism informs the model to pay attention to certain
portions of the input. Using this as a means to examine model decisions, the attention
heads of a Vision Transformer and the important pixels for a ResNet50 model were compared
to radiologist annotations to determine appropriateness of each model’s attention
mechanisms in classifying two datasets. Both the ResNet50 model and the ViT model
had high levels of agreement with the radiologist annotations at 88.07% and 94.85%
and 94.72% and 96.96% respectively; however, the vision transformers had better performance
in both test accuracy and agreement.
Generative pre-training is a method for training model weights without extra labelling.
During training, samples have portions masked before being given to an autoencoder
which tries to recover the original input. This training method has applications in
areas of medical imaging such as reconstruction and classification on imbalanced data.
Adapting generative pre-training methods for reconstructing undersampled k-space data
with a 25% masking produced the most faithful reconstructions with an average MSE
of 6.763 E-4 and an SSIM of 0.917. Interestingly, our results are more aligned with the
optimal masking ratio of language models which have an ideal masking ratio between
15-25% rather than the 75% masking ratio demonstrated in an imaging application. This
reinforces the idea that further study is required to properly implement broader deep
learning advances from computer vision in medical imaging. On the other hand, generative
pre-training through masked autoencoders (MAEs) was investigated as a method of
training on imbalanced data, achieving high single neuroimaging modality classification
performance of 95.24%.
Just as convolutional neural networks (CNNs) revolutionized the discipline, transformers
have likewise established themselves as a defining breakthrough. Bringing many
interesting avenues for exploration with them, transformers have pushed the state-of-theart
and deep learning models now see use all around the world. Beyond the performance
increases that transformers brought about, there is another lesson to be learned. Whatever
the next ground-breaking architecture may be, researchers should consider delving into all aspects and applications rather than blindly hoping for performance improvements
and fully automated tools.