August 27, 2024Open Access

Applying ViT in Generalized Few-shot Semantic Segmentation

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

This paper explores the capability of ViT-based models under the generalized few-shot semantic segmentation (GFSS) framework. We conduct experiments with various combinations of backbone models, including ResNets and pretrained Vision Transformer (ViT) -based models, along with decoders featuring a linear classifier, UPerNet, and Mask Transformer. The structure made of DINOv2 and linear classifier takes the lead on popular few-shot segmentation bench mark PASCAL-5ⁱ, substantially outperforming the best of ResNet structure by 116% in one-shot scenario. We demonstrate the great potential of large pretrained ViT-based model on GFSS task, and expect further improvement on testing benchmarks. However, a potential caveat is that when applying pure ViT-based model and large scale ViT decoder, the model is easy to overfit.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo

Cite This Study

Geng et al. (Tue,) studied this question.

synapsesocial.com/papers/68e5adbeb6db64358754702a https://doi.org/https://doi.org/10.48550/arxiv.2408.14957

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Me gusta

Guardar

Ver artículo completo