Multi-modal few-shot semantic segmentation (FSS) aims to perform dense prediction from multiple modality images including visible image, depth image, and thermal image with a few annotated samples. However, some efforts treat the three modality information equally, where they don't incorporate the inherent differences among multiple modalities. Besides, the objects vary in size greatly, and the cutting-edge matching paradigms fail to establish an effective support-query connection. Therefore, we propose a novel scale-invariant feature matching network (i.e., SFM-Net), which consists of an encoder, a feature matching block, a feature elevation block, and a decoder, to conduct visible-depth-thermal (V-D-T) few-shot semantic segmentation. Firstly, in the encoder part, after the extraction of multi-level initial features, we fuse each level's RGB feature and thermal feature, yielding the support features and the query features. Secondly, in the feature matching block, a pixel-to-patch cross-attention (PTPCA) module is deployed to explore the correlation between each level's support feature and the query feature, where the pixel-to-patch pooling (PTP-pool) units are designed to build scale-invariant relationships, generating the coarse mask for the query image. Thirdly, in the feature elevation block, we employ the prior-related fusion (PF) module to integrate the depth image with a coarse mask via the cross-attention mechanism, yielding the enhanced coarse prediction result, which is further aggregated in a bottom-up way. Finally, in the decoder, we deploy a reverse attention (RA) unit to gradually explore the complementarity between object internal regions and spatial details, and further generate the final segmentation results via conventional convolution layers. Extensive experiments are conducted on the VDT-2048-5i dataset, and the results show that our model outperforms the state-of-the-art methods with a large margin.
Zhou et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: