What question did this study set out to answer?

This research aims to improve medical image segmentation by developing a more efficient framework.

May 7, 2026Open Access

VMMedSAM-X: A State-Enhanced Dual-Branch Encoder for Efficient Promptable Medical Image Segmentation

Key Points

This research aims to improve medical image segmentation by developing a more efficient framework.
Introduced VMMedSAM-X framework incorporating state space modeling into MedSAM architecture.
Utilized extended long short-term memory (xLSTM) and two-dimensional selective scanning (SS2D) techniques.
Implemented dual-path cross-attention mechanism for long-range dependency modeling.
Reduced floating-point operations from 369.44 G to 17.36 G for efficiency.
Achieved 2.4× improvement in inference speed over Vision Transformer-based encoder.
Showed improvements in Dice Similarity Coefficient and Intersection over Union metrics compared to baseline models.

Abstract

Medical image segmentation plays a crucial role in clinical diagnosis and treatment planning. However, existing segmentation frameworks frequently exhibit high computational complexity and often fail to retain fine-grained structural details—especially along intricate anatomical boundaries such as blood vessels and tumor margins. To overcome these limitations, we propose VMMedSAM-X, an efficient and computationally economical medical image segmentation framework that incorporates structured state space modeling into the Medical Segment Anything Model (MedSAM) architecture. The proposed method adopts a state-enhanced encoder that combines extended long short-term memory (xLSTM) with two-dimensional selective scanning (SS2D) and a dual-path cross-attention mechanism to enhance long-range dependency modeling while maintaining linear computational complexity. Experiments conducted on the 1024×1024 ACDC cardiac MRI dataset show that the proposed encoder reduces floating-point operations from 369.44 G to 17.36 G and achieves a 2.4× improvement in inference speed compared with the Vision Transformer (ViT)-based encoder. Additional evaluations on the SegTHOR and MSD-Lung datasets demonstrate consistent improvements in Dice Similarity Coefficient (DSC) and Intersection over Union (IoU) metrics over MedSAM and Vision Mamba U-Net (VM-UNet) baselines. These results indicate that the proposed framework provides an effective and computationally efficient solution for high-resolution medical image segmentation.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper