What question did this study set out to answer?

This research aims to enhance image segmentation accuracy by leveraging text guidance and cross-modal feature fusion.

March 21, 2026Open Access

Interactive text-guided image segmentation via vision Mamba and large language models

Key Points

This research aims to enhance image segmentation accuracy by leveraging text guidance and cross-modal feature fusion.
Proposes an H-type Bidirectional Alignment Network for text-guided image segmentation.
Uses a 12-layer Vision Mamba for visual encoding with local convolution and residual structures.
Implements a text encoder based on the Qwen model, tuning upper layers for better semantic representation.
Utilizes Q-Former in the cross-modal alignment module for constructing query vectors and achieving bidirectional supervision.
Employs a multi-scale decoder for optimizing segmentation results through an interactive iterative mechanism.
Experimental validation on RefCOCO, RefCOCO+, and RefCOCOg datasets shows significant performance improvements.
The proposed method effectively enhances segmentation results compared to existing techniques.

Abstract

This paper proposes an H-type Bidirectional Alignment Network for text-guided image segmentation, enabling efficient and accurate cross-modal feature fusion. For visual encoding, a 12-layer Vision Mamba with four stages is adopted. It captures long-range dependencies via a Selective State Space Model (SSM) and, at the same time, leverages local convolution and residual structures to enhance boundary and detail information—thereby reducing computational complexity when processing high-resolution images. The text encoder is based on the Qwen model and employs a strategy of freezing the bottom layers while fine-tuning the upper layers, aiming to obtain semantic representations and adapt to referential expressions. The cross-modal alignment module utilizes Q-Former to construct learnable query vectors: the forward path accomplishes the text-to-image segmentation task, while the backward path reconstructs the attention distribution of the image over the text. This bidirectional supervision mechanism is thus realized to constrain cross-modal consistency. The multi-scale decoder fuses visual and aligned features, and supports the model in gradually optimizing segmentation results through an interactive iterative mechanism. Experiments on the RefCOCO, RefCOCO+, and RefCOCOg datasets verify the effectiveness of the proposed method, which demonstrates performance improvements compared to existing approaches.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yao Meng

Haochen Sun

Wei Jiang

Journals

Scientific Reports

Actions

Institutions

North China University of Water Resources and Electric Power

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Interactive text-guided image segmentation via vision Mamba and large language models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study