What question did this study set out to answer?

To develop a framework that improves the generalization of vision-language-action models in robotic manipulation tasks.

April 3, 2026Open Access

Robust Vision-Language-Action Models via Object-Centric Learning and Distance-Based Chunk Alignment

Key Points

To develop a framework that improves the generalization of vision-language-action models in robotic manipulation tasks.
Proposed an object-centric learning framework for enhancing VLA models.
Utilized an instance segmentation model to identify relevant objects for tasks.
Trained a policy using original RGB images and object-focused representations: masked images and object-only crops.
Implemented a distance-based chunk alignment for smooth action transitions.
Demonstrated robust performance in both simulation and real hardware settings.
Achieved stable trajectories across various manipulation tasks.
Validated the framework's efficiency in training object-aware robot behaviors.

Abstract

Vision–language–action (VLA) models have shown strong potential for enabling robots to interpret goals and perform complex manipulation tasks by integrating perception, language, and control. However, existing VLAs rely heavily on large-scale, diverse demonstration datasets, which are difficult and expensive to collect. When trained with limited data, they often overfit to irrelevant visual cues such as background, lighting, or viewpoint, resulting in weak generalization. To overcome this limitation, we propose a simple yet effective object-centric learning framework for VLA. For each sub-task, the framework leverages an instance segmentation foundation model to identify and track task-relevant objects, and trains the policy on both the original RGB scene and two object-focused representations: (i) a masked image emphasizing the target object and (ii) an object-only crop. These multiple visual inputs share the same action supervision, encouraging the policy to attend to the manipulated object rather than the surrounding context. Furthermore, a distance-based chunk alignment mechanism ensures smooth control transitions between consecutive predicted action segments. Experiments conducted in both simulation and real hardware demonstrate that the proposed method achieves robust performance and stable trajectories across various manipulation tasks, validating its practicality and efficiency in training object-aware robotic behaviors.

Perguntar à IA

Bookmark

View Full Paper