With the rapid development of multi-modal large models, the Visual–Language–Action (VLA) model has gradually become a new paradigm for autonomous robot operations. The VLA model encodes experimental images and text instructions separately using an image encoder and a text encoder. The encoded multi-modal vector information is then fed into a large language model (LLM) to generate the next action. While they inherit the generalization capabilities of large language models, VLA models often struggle to ensure accuracy and reliability in complex scenes. Some studies have attempted to improve VLA performance by enhancing the fine-tuning process or introducing staged operations; however, these improvements often overlook the stable extraction of important visual features, which are crucial for VLA models. In typical VLA tasks, the instruction text inherently contains semantic information related to image elements. Research has shown that leveraging text supervision for visual feature extraction can enhance feature quality. In this paper, we propose a semantically supervised visual encoder called SeDINO (Semantically Supervised DINO), which efficiently fuses DINO’s element localization capabilities with CLIP’s semantic information. We further employ an MLP (Multi-Layer Perceptron) network to align the semantic vectors output by the CLIP text encoder with the image feature vectors derived from DINO, fully leveraging DINO’s element localization and CLIP’s semantic interaction capabilities. We validate SeDINO on six mainstream image datasets, and it demonstrates superior segmentation performance compared to current leading models. Additionally, we incorporate the proposed SeDINO into the VLA framework, using OpenVLA-7B and DINOv2-base as backbone models, and evaluate it on the LIBERO dataset and real-world scenarios.
Tian et al. (Sat,) studied this question.