What question did this study set out to answer?

The aim is to improve the visual feature extraction process within visual-language-action models using semantic supervision.

February 2, 2026Open Access

Semantically Supervised SeDINO Encoder for Visual–Language–Action Model

Key Points

The aim is to improve the visual feature extraction process within visual-language-action models using semantic supervision.
Developed a semantically supervised visual encoder called SeDINO.
Fused DINO's element localization capabilities with CLIP's semantic information.
Used a multi-layer perceptron (MLP) for aligning semantic and image feature vectors.
Validated SeDINO on six mainstream image datasets and incorporated it into the VLA framework.
SeDINO achieved superior segmentation performance compared to leading models.
Improved alignment of semantic vectors and image features led to better recognition in complex scenes.

Abstract

With the rapid development of multi-modal large models, the Visual–Language–Action (VLA) model has gradually become a new paradigm for autonomous robot operations. The VLA model encodes experimental images and text instructions separately using an image encoder and a text encoder. The encoded multi-modal vector information is then fed into a large language model (LLM) to generate the next action. While they inherit the generalization capabilities of large language models, VLA models often struggle to ensure accuracy and reliability in complex scenes. Some studies have attempted to improve VLA performance by enhancing the fine-tuning process or introducing staged operations; however, these improvements often overlook the stable extraction of important visual features, which are crucial for VLA models. In typical VLA tasks, the instruction text inherently contains semantic information related to image elements. Research has shown that leveraging text supervision for visual feature extraction can enhance feature quality. In this paper, we propose a semantically supervised visual encoder called SeDINO (Semantically Supervised DINO), which efficiently fuses DINO’s element localization capabilities with CLIP’s semantic information. We further employ an MLP (Multi-Layer Perceptron) network to align the semantic vectors output by the CLIP text encoder with the image feature vectors derived from DINO, fully leveraging DINO’s element localization and CLIP’s semantic interaction capabilities. We validate SeDINO on six mainstream image datasets, and it demonstrates superior segmentation performance compared to current leading models. Additionally, we incorporate the proposed SeDINO into the VLA framework, using OpenVLA-7B and DINOv2-base as backbone models, and evaluate it on the LIBERO dataset and real-world scenarios.

Semantically Supervised SeDINO Encoder for Visual–Language–Action Model

Key Points

Abstract

Cite This Study