What question did this study set out to answer?

The study aims to enhance instance segmentation performance in autonomous driving using vision language models.

June 11, 2026

Driving Environment Instance Segmentation Using a Vision Language Model (VLM)

Key Points

The study aims to enhance instance segmentation performance in autonomous driving using vision language models.
Proposed an auxiliary caption loss for LISA architecture.
Focused on improving model training with language-guided image segmentation techniques.
Validated the proposed method through experimental studies.
Improved instance segmentation accuracy in complex driving scenarios.
Established deeper associations between visual features and linguistic semantics.

Abstract

The safety of autonomous driving systems depends on their ability to perceive the surrounding environment accurately. Additionally, instance segmentation, which identifies individual objects, is a core-enabling technology. Recently, studies using vision language models (VLMs), such as language-guided image segmentation (LISA)—which segment target objects based on natural language instructions—have demonstrated the potential to overcome the limitations of conventional methods that recognize only fixed classes. However, there remains room for improvement in how these models exploit more effectively the semantic understanding capabilities of VLMs for segmentation tasks. Motivated by this observation, this study proposes a method to enhance VLM training by incorporating an auxiliary caption loss into the LISA architecture. Based on this approach, we aim to improve instance segmentation performance in complex scenarios such as autonomous driving. The proposed approach encourages the model to learn segmentation instructions and caption information that captures the image’s global context, enabling the VLM to establish deeper associations between visual features and linguistic semantics. The effectiveness of the proposed method is validated experimentally.

Bookmark

Driving Environment Instance Segmentation Using a Vision Language Model (VLM)

Key Points

Abstract

Cite This Study

Also Consider

Also Consider