Contact-rich manipulation in industrial robotics faces significant challenges in skill transfer where conventional vision-based systems rely on indirect force inference from visual observations. To address this limitation, this study developed a vision-force fusion framework combining visual and haptic measurements through learned attention mechanisms to generate manipulation trajectories satisfying contact constraints. The approach employs diffusion-based iterative refinement conditioned on multimodal observations from RGB-D cameras and force sensors. Experimental validation employed 387 demonstrations from diverse contact-rich assembly tasks using a collaborative robot with multimodal perception. The approach achieved 87.3% success rate across target task variants, outperforming Diffusion Policy (78.6%), Action Chunking Transformer (73.9%), and Behavior Cloning (62.3%), representing a 25 percentage point improvement over the baseline. Ablation studies confirmed the necessity of multimodal fusion, with vision-only achieving 67.8% and force-only 59.1%. Attention-based fusion demonstrated 8.2 percentage points higher performance than linear weight combinations while maintaining force tracking root mean square error (RMSE) of 1.38N. Cross-task generalization experiments revealed consistent performance above 86% across geometric variations. Robustness evaluation under sensor degradation maintained 78.4% success with force noise and 75.8% with vision impairment, while achieving 3.09-second inference time suitable for real-time control. These results establish that explicit integration of haptic measurements addresses limitations in vision-based force estimation, enabling more precise contact regulation for industrial assembly operations with tight tolerances and variable component geometries.
Xing et al. (Sat,) studied this question.