This paper proposes an H-type Bidirectional Alignment Network for text-guided image segmentation, enabling efficient and accurate cross-modal feature fusion. For visual encoding, a 12-layer Vision Mamba with four stages is adopted. It captures long-range dependencies via a Selective State Space Model (SSM) and, at the same time, leverages local convolution and residual structures to enhance boundary and detail information—thereby reducing computational complexity when processing high-resolution images. The text encoder is based on the Qwen model and employs a strategy of freezing the bottom layers while fine-tuning the upper layers, aiming to obtain semantic representations and adapt to referential expressions. The cross-modal alignment module utilizes Q-Former to construct learnable query vectors: the forward path accomplishes the text-to-image segmentation task, while the backward path reconstructs the attention distribution of the image over the text. This bidirectional supervision mechanism is thus realized to constrain cross-modal consistency. The multi-scale decoder fuses visual and aligned features, and supports the model in gradually optimizing segmentation results through an interactive iterative mechanism. Experiments on the RefCOCO, RefCOCO+, and RefCOCOg datasets verify the effectiveness of the proposed method, which demonstrates performance improvements compared to existing approaches.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yao Meng
Haochen Sun
Wei Jiang
Scientific Reports
North China University of Water Resources and Electric Power
Building similarity graph...
Analyzing shared references across papers
Loading...
Meng et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69be38b56e48c4981c679480 — DOI: https://doi.org/10.1038/s41598-026-43841-w