What question did this study set out to answer?

April 20, 2026Open Access

Learning Conceptual Text Prompts from Visual Regions of Interest for Medical Image Segmentation

Key Points

The research aims to develop a method for generating conceptual text prompts from visual regions of interest to enhance medical image segmentation.
Proposed a method to learn text prompts directly from visual ROIs.
Utilized a large multimodal model to extract textual conceptual attributes.
Developed a text latent space transformation module for generating fine-grained pseudo-text prompts.
Implemented self-adding noise knowledge distillation to improve text-guided inference.
Achieved superior segmentation accuracy compared to state-of-the-art vision-language segmentation models.
Demonstrated effective reduction of error accumulation during the segmentation process.

Abstract

Vision–language segmentation models (VLSMs) are effective in medical image segmentation tasks. However, a major limitation of these models is their dependence on manually crafted textual inputs. Studies have used visual question answering to semiautomatically generate textual information. However, these methods encounter challenges such as error accumulation. Herein, we propose a method to learn conceptual text prompts directly from visual regions of interest (ROIs) for facilitating medical image segmentation. We extracted textual conceptual attributes from ROIs using a large multimodal model to derive coarse real-text prompts. A text latent space transformation module accepted the ROI images as input for generating fine-grained pseudo-text prompts to compensate for the lack of image detail perception in the abovementioned real-text prompts. These prompts were encoded into a unified text embedding. Thereafter, we applied a self-adding noise knowledge distillation method to transfer the knowledge from text embedding to the class token of the image encoder, enabling direct text-guided inference during testing while reducing error accumulation. Our approach minimized the need for manual prompt design by leveraging explicit discrete and implicit continuous text prompts to effectively guide visual segmentation. Extensive evaluation across 13 medical image segmentation datasets demonstrated that our model outperformed the state-of-the-art VLSMs and vision-based segmentation models, exhibiting superior segmentation accuracy.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

He et al. (Wed,) studied this question.

synapsesocial.com/papers/69e5c22d03c29399140289aa https://doi.org/https://doi.org/10.1016/j.eng.2026.04.006

Bookmark

View Full Paper