Global context information is essential for semantic segmentation of remote sensing (RS) images. Due to their remarkable capability to capture global context information and model long-range dependencies, vision transformers have demonstrated great performance on semantic segmentation. However, the high computational complexity of vision transformers impedes their broad application in resource-constrained environments for RS image segmentation. To address this challenge, we propose multi-faceted adaptive token pruning (MATP) to reduce computational cost while maintaining relatively high accuracy. MATP is designed to prune well-learned tokens which do not have a close relation to other tokens. To quantify these two metrics, MATP employs multi-faceted scores: entropy, to evaluate the learning progression of tokens; and attention weight, to assess token correlations. Specially, MATP utilizes adaptive criteria for each score that are automatically adjusted based on specific input features. A token is pruned only when both criteria are satisfied. Overall, MATP facilitates the utilization of vision transformers in resource-constrained environments. Experiments conducted on three widely used datasets reveal that MATP reduces the computation cost about 67–70% with about 3–6% accuracy degradation, achieving a superior trade-off between accuracy and computational cost compared to the state of the art.
Zhang et al. (Fri,) studied this question.
Synapse has enriched 4 closely related papers on similar clinical questions. Consider them for comparative context: