ABSTRACT This paper proposes a novel attention‐based convolutional neural network (CNN) for sensor‐free robotic arm control, aiming to improve six dimensional (6D) pose estimation and end‐effector operation in an end‐to‐end manner. Unlike traditional methods that rely on explicit feature engineering or sensor feedback, our approach leverages a sophisticated attention mechanism within the convolutional backbone to enhance spatial awareness. The proposed localization sub‐module scores each prior regime through a weighted average of activation maps, allowing the network to focus on the most informative regions of the input. Additionally, we introduce a two‐phase training methodology requiring only image‐level annotations. In the first phase, the network learns to extract discriminative features from synthetic images, which are crucial for accurate 6D pose prediction. In the second phase, a reinforcement learning agent, equipped with the trained vision model as its sensory module, is optimized using a sparse reward function to refine action policies. Experimental evaluations in two virtual scenarios demonstrate that our method outperforms popular CNN‐based approaches in terms of both accuracy and efficiency. Specifically, our method improves task success rates by 52.9% and reduces position error by 72.3% compared to baseline models, showcasing its effectiveness in sensor‐free robotic arm control.
Wu et al. (Wed,) studied this question.