Medical image segmentation plays a crucial role in clinical diagnosis and treatment planning. However, existing segmentation frameworks frequently exhibit high computational complexity and often fail to retain fine-grained structural details—especially along intricate anatomical boundaries such as blood vessels and tumor margins. To overcome these limitations, we propose VMMedSAM-X, an efficient and computationally economical medical image segmentation framework that incorporates structured state space modeling into the Medical Segment Anything Model (MedSAM) architecture. The proposed method adopts a state-enhanced encoder that combines extended long short-term memory (xLSTM) with two-dimensional selective scanning (SS2D) and a dual-path cross-attention mechanism to enhance long-range dependency modeling while maintaining linear computational complexity. Experiments conducted on the 1024×1024 ACDC cardiac MRI dataset show that the proposed encoder reduces floating-point operations from 369.44 G to 17.36 G and achieves a 2.4× improvement in inference speed compared with the Vision Transformer (ViT)-based encoder. Additional evaluations on the SegTHOR and MSD-Lung datasets demonstrate consistent improvements in Dice Similarity Coefficient (DSC) and Intersection over Union (IoU) metrics over MedSAM and Vision Mamba U-Net (VM-UNet) baselines. These results indicate that the proposed framework provides an effective and computationally efficient solution for high-resolution medical image segmentation.
Zhang et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: