Vision–language–action (VLA) models have shown strong potential for enabling robots to interpret goals and perform complex manipulation tasks by integrating perception, language, and control. However, existing VLAs rely heavily on large-scale, diverse demonstration datasets, which are difficult and expensive to collect. When trained with limited data, they often overfit to irrelevant visual cues such as background, lighting, or viewpoint, resulting in weak generalization. To overcome this limitation, we propose a simple yet effective object-centric learning framework for VLA. For each sub-task, the framework leverages an instance segmentation foundation model to identify and track task-relevant objects, and trains the policy on both the original RGB scene and two object-focused representations: (i) a masked image emphasizing the target object and (ii) an object-only crop. These multiple visual inputs share the same action supervision, encouraging the policy to attend to the manipulated object rather than the surrounding context. Furthermore, a distance-based chunk alignment mechanism ensures smooth control transitions between consecutive predicted action segments. Experiments conducted in both simulation and real hardware demonstrate that the proposed method achieves robust performance and stable trajectories across various manipulation tasks, validating its practicality and efficiency in training object-aware robotic behaviors.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sung-Gil Park
Yong-Geon Kim
Seuk-Woo Ryu
Applied Sciences
Korea University
Inha University
LG (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Park et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69cf5e865a333a821460cec7 — DOI: https://doi.org/10.3390/app16073376