What question did this study set out to answer?

This research aims to improve skill transfer in contact-rich robotic tasks by integrating visual and haptic inputs.

May 6, 2026

Visual-haptic fusion for contact-rich robot skill transfer via diffusion models

Key Points

This research aims to improve skill transfer in contact-rich robotic tasks by integrating visual and haptic inputs.
Developed a vision-force fusion framework combining visual and haptic measurements
Employed diffusion-based iterative refinement using RGB-D cameras and force sensors
Utilized attention mechanisms for trajectory generation and contact constraints
Achieved 87.3% success rate in skill transfer, outperforming several baseline methods
Demonstrated 8.2 percentage points higher performance with attention-based fusion compared to linear combinations
Maintained real-time control capabilities with an inference time of 3.09 seconds

Abstract

Contact-rich manipulation in industrial robotics faces significant challenges in skill transfer where conventional vision-based systems rely on indirect force inference from visual observations. To address this limitation, this study developed a vision-force fusion framework combining visual and haptic measurements through learned attention mechanisms to generate manipulation trajectories satisfying contact constraints. The approach employs diffusion-based iterative refinement conditioned on multimodal observations from RGB-D cameras and force sensors. Experimental validation employed 387 demonstrations from diverse contact-rich assembly tasks using a collaborative robot with multimodal perception. The approach achieved 87.3% success rate across target task variants, outperforming Diffusion Policy (78.6%), Action Chunking Transformer (73.9%), and Behavior Cloning (62.3%), representing a 25 percentage point improvement over the baseline. Ablation studies confirmed the necessity of multimodal fusion, with vision-only achieving 67.8% and force-only 59.1%. Attention-based fusion demonstrated 8.2 percentage points higher performance than linear weight combinations while maintaining force tracking root mean square error (RMSE) of 1.38N. Cross-task generalization experiments revealed consistent performance above 86% across geometric variations. Robustness evaluation under sensor degradation maintained 78.4% success with force noise and 75.8% with vision impairment, while achieving 3.09-second inference time suitable for real-time control. These results establish that explicit integration of haptic measurements addresses limitations in vision-based force estimation, enabling more precise contact regulation for industrial assembly operations with tight tolerances and variable component geometries.

Ask AI

Helpful

Bookmark