What question did this study set out to answer?

The aim is to enhance human-robot collaboration through a Vision-Language-Action model with minimal prompting.

January 17, 2026Open Access

Robotic Assistant: Completing Collaborative Tasks with Dexterous Vision-Language-Action Models

Key Points

The aim is to enhance human-robot collaboration through a Vision-Language-Action model with minimal prompting.
Adapted Open-VLA model for task-aware perception using FiLM conditioning and auxiliary intent prediction.
Developed action-space post-processing for predicting compact deltas and PCA-reduced finger joints.
Utilized a multi-view, teleoperated dataset with MediaPipe hand poses for evaluation.
Compact action deltas show effective behavior in task execution.
Four principal components account for approximately 96% of hand-joint variance.
Action post-processing significantly boosts performance, while auxiliary intent is beneficial, and directional motion loss reduces performance.

Abstract

We adapt a pre-trained Vision-Language-Action (VLA) model (Open-VLA) for dexterous human-robot collaboration with minimal language prompting. Our approach adds (i) FiLM conditioning to visual backbones for task-aware perception, (ii) an auxiliary intent head that predicts collaborator hand pose and target cues, and (iii) action-space post-processing that predicts compact deltas (position/rotation) and PCA-reduced finger joints before mapping to full commands. Using a multi-view, teleoperated Franka and Mimic-hand dataset augmented with MediaPipe hand poses, we demonstrate that delta actions are well-behaved and that four principal components explain ~96% of hand-joint variance. Ablations identify action post-processing as the primary performance driver; auxiliary intent helps, FiLM is mixed, and a directional motion loss is detrimental. A real-time stack (~0.3 s latency on one RTX 4090) composes "pick-up" and "pass" into a long-horizon behavior. We surface "trainer overfitting" to specific demonstrators as the key limitation.

Read Full Paperexternally

AI से पूछें

Bookmark

View Full Paper