eX-ViT: A Novel explainable vision transformer for weakly supervised semantic segmentation

Yu, Lu, Xiang, Wei, Fang, Juan, Chen, Yi-Ping Phoebe, and Chi, Lianhua (2023) eX-ViT: A Novel explainable vision transformer for weakly supervised semantic segmentation. Pattern Recognition, 142. 109666.

[img]
Preview
PDF (Published Version) - Published Version
Available under License Creative Commons Attribution.

Download (4MB) | Preview
View at Publisher Website: https://doi.org/10.1016/j.patcog.2023.10...
 
3
187


Abstract

Recently vision transformer models have become prominent models for a multitude of vision tasks. These models, however, are usually opaque with weak feature interpretability, making their predictions inaccessible to the users. While there has been a surge of interest in the development of post-hoc solutions that explain model decisions, these methods can not be broadly applied to different transformer architectures, as rules for interpretability have to change accordingly based on the heterogeneity of data and model structures. Moreover, there is no method currently built for an intrinsically interpretable transformer, which is able to explain its reasoning process and provide a faithful explanation. To close these crucial gaps, we propose a novel vision transformer dubbed the eXplainable Vision Transformer (eX-ViT), an intrinsically interpretable transformer model that is able to jointly discover robust interpretable features and perform the prediction. Specifically, eX-ViT is composed of the Explainable Multi-Head Attention (E-MHA) module, the Attribute-guided Explainer (AttE) module with the self-supervised attribute-guided loss. The E-MHA tailors explainable attention weights that are able to learn semantically interpretable representations from tokens in terms of model decisions with noise robustness. Meanwhile, AttE is proposed to encode discriminative attribute features for the target object through diverse attribute discovery, which constitutes faithful evidence for the model predictions. Additionally, we have developed a self-supervised attribute-guided loss for our eX-ViT architecture, which utilizes both the attribute discriminability mechanism and the attribute diversity mechanism to enhance the quality of learned representations. As a result, the proposed eX-ViT model can produce faithful and robust interpretations with a variety of learned attributes. To verify and evaluate our method, we apply the eX-ViT to several weakly supervised semantic segmentation (WSSS) tasks, since these tasks typically rely on accurate visual explanations to extract object localization maps. Particularly, the explanation results obtained via eX-ViT are regarded as pseudo segmentation labels to train WSSS models. Comprehensive simulation results illustrate that our proposed eX-ViT model achieves comparable performance to supervised baselines, while surpassing the accuracy and interpretability of state-of-the-art black-box methods using only image-level labels.

Item ID: 78905
Item Type: Article (Research - C1)
ISSN: 1873-5142
Keywords: Attention map, Explainable, Transformer, Weakly supervised
Copyright Information: © 2023 The Authors. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Funders: Australian Research Council (ARC)
Projects and Grants: ARC DP220101634
Date Deposited: 01 Nov 2023 00:38
FoR Codes: 40 ENGINEERING > 4003 Biomedical engineering > 400306 Computational physiology @ 100%
SEO Codes: 28 EXPANDING KNOWLEDGE > 2801 Expanding knowledge > 280112 Expanding knowledge in the health sciences @ 100%
Downloads: Total: 187
Last 12 Months: 115
More Statistics

Actions (Repository Staff Only)

Item Control Page Item Control Page