Vision-Language Transformer for Interpretable Pathology Visual Question Answering

Naseem, Usman, Khushi, Matloob, and Kim, Jinman (2022) Vision-Language Transformer for Interpretable Pathology Visual Question Answering. IEEE Journal of Biomedical and Health Informatics, 27 (4). pp. 1681-1690.

[img] PDF (Published Version) - Published Version
Restricted to Repository staff only

View at Publisher Website: https://doi.org/10.1109/JBHI.2022.316375...


Abstract

Pathology visual question answering (PathVQA) attempts to answer a medical question posed by pathology images. Despite its great potential in healthcare, it is not widely adopted because it requires interactions on both the image (vision) and question (language) to generate an answer. Existing methods focused on treating vision and language features independently, which were unable to capture the high and low-level interactions that are required for VQA. Further, these methods failed to offer capabilities to interpret the retrieved answers, which are obscure to humans where the models’ interpretability to justify the retrieved answers has remained largely unexplored. Motivated by these limitations, we introduce a vision-language transformer that embeds vision (images) and language (questions) features for an interpretable PathVQA. We present an interpretable tra nsformer-based P ath- VQA (TraP-VQA), where we embed transformers’ encoder layers with vision and language features extracted using pre-trained CNN and domain-specific language model (LM), respectively. A decoder layer is then embedded to upsample the encoded features for the final prediction for PathVQA. Our experiments showed that our TraP-VQA outperformed the state-of-the-art comparative methods with public PathVQA dataset. Our experiments validated the robustness of our model on another medical VQA dataset, and the ablation study demonstrated the capability of our integrated transformer-based vision-language model for PathVQA. Finally, we present the visualization results of both text and images, which explain the reason for a retrieved answer in PathVQA.

Item ID: 79232
Item Type: Article (Research - C1)
ISSN: 2168-2208
Copyright Information: © 2022 IEEE.
Funders: Australian Research Council (ARC)
Date Deposited: 05 Jul 2023 22:48
FoR Codes: 46 INFORMATION AND COMPUTING SCIENCES > 4602 Artificial intelligence > 460208 Natural language processing @ 70%
46 INFORMATION AND COMPUTING SCIENCES > 4603 Computer vision and multimedia computation > 460307 Multimodal analysis and synthesis @ 30%
SEO Codes: 22 INFORMATION AND COMMUNICATION SERVICES > 2204 Information systems, technologies and services > 220403 Artificial intelligence @ 100%
More Statistics

Actions (Repository Staff Only)

Item Control Page Item Control Page