We adapt a pre-trained Vision-Language-Action (VLA) model (Open-VLA) for dexterous human-robot collaboration with minimal language prompting. Our approach adds (i) FiLM conditioning to visual backbones for task-aware perception, (ii) an auxiliary intent head that predicts collaborator hand pose and target cues, and (iii) action-space post-processing that predicts compact deltas (position/rotation) and PCA-reduced finger joints before mapping to full commands. Using a multi-view, teleoperated Franka and Mimic-hand dataset augmented with MediaPipe hand poses, we demonstrate that delta actions are well-behaved and that four principal components explain ~96% of hand-joint variance. Ablations identify action post-processing as the primary performance driver; auxiliary intent helps, FiLM is mixed, and a directional motion loss is detrimental. A real-time stack (~0.3 s latency on one RTX 4090) composes "pick-up" and "pass" into a long-horizon behavior. We surface "trainer overfitting" to specific demonstrators as the key limitation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Boshi An
Chenyu Yang
Robert K. Katzschmann
Building similarity graph...
Analyzing shared references across papers
Loading...
An et al. (Wed,) studied this question.
www.synapsesocial.com/papers/696b2696d2a12237a9349e7a — DOI: https://doi.org/10.3929/ethz-c-000790221
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: