The safety of autonomous driving systems depends on their ability to perceive the surrounding environment accurately. Additionally, instance segmentation, which identifies individual objects, is a core-enabling technology. Recently, studies using vision language models (VLMs), such as language-guided image segmentation (LISA)—which segment target objects based on natural language instructions—have demonstrated the potential to overcome the limitations of conventional methods that recognize only fixed classes. However, there remains room for improvement in how these models exploit more effectively the semantic understanding capabilities of VLMs for segmentation tasks. Motivated by this observation, this study proposes a method to enhance VLM training by incorporating an auxiliary caption loss into the LISA architecture. Based on this approach, we aim to improve instance segmentation performance in complex scenarios such as autonomous driving. The proposed approach encourages the model to learn segmentation instructions and caption information that captures the image’s global context, enabling the VLM to establish deeper associations between visual features and linguistic semantics. The effectiveness of the proposed method is validated experimentally.
Choi et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: