What question did this study set out to answer?

The aim is to improve few-shot semantic segmentation using visible, depth, and thermal images by incorporating scale-invariant features.

February 22, 2026

Scale-invariant Feature Matching Network for V-D-T Few-Shot Semantic Segmentation

Key Points

The aim is to improve few-shot semantic segmentation using visible, depth, and thermal images by incorporating scale-invariant features.
Developed a scale-invariant feature matching network (SFM-Net) consisting of an encoder, matching block, elevation block, and decoder.
Employ pixel-to-patch cross-attention and pixel-to-patch pooling to establish scale-invariant relations.
Integrated prior-related fusion to enhance coarse mask predictions during segmentation.
Utilized a reverse attention unit to refine segmentation outputs.
The proposed method significantly outperforms existing state-of-the-art techniques in few-shot semantic segmentation.
Achieved improved accuracy in generating segmentation masks for multi-modal datasets.
Successfully integrated features from RGB and thermal images for enhanced performance.

Abstract

Multi-modal few-shot semantic segmentation (FSS) aims to perform dense prediction from multiple modality images including visible image, depth image, and thermal image with a few annotated samples. However, some efforts treat the three modality information equally, where they don't incorporate the inherent differences among multiple modalities. Besides, the objects vary in size greatly, and the cutting-edge matching paradigms fail to establish an effective support-query connection. Therefore, we propose a novel scale-invariant feature matching network (i.e., SFM-Net), which consists of an encoder, a feature matching block, a feature elevation block, and a decoder, to conduct visible-depth-thermal (V-D-T) few-shot semantic segmentation. Firstly, in the encoder part, after the extraction of multi-level initial features, we fuse each level's RGB feature and thermal feature, yielding the support features and the query features. Secondly, in the feature matching block, a pixel-to-patch cross-attention (PTPCA) module is deployed to explore the correlation between each level's support feature and the query feature, where the pixel-to-patch pooling (PTP-pool) units are designed to build scale-invariant relationships, generating the coarse mask for the query image. Thirdly, in the feature elevation block, we employ the prior-related fusion (PF) module to integrate the depth image with a coarse mask via the cross-attention mechanism, yielding the enhanced coarse prediction result, which is further aggregated in a bottom-up way. Finally, in the decoder, we deploy a reverse attention (RA) unit to gradually explore the complementarity between object internal regions and spatial details, and further generate the final segmentation results via conventional convolution layers. Extensive experiments are conducted on the VDT-2048-5i dataset, and the results show that our model outperforms the state-of-the-art methods with a large margin.

KI fragen

Bookmark