Non-visual motor guidance is essential for expert hands-on tasks, where visual attention must remain focused on the workspace. However, conveying complex 3D spatial information through wearable haptics remains a challenge due to limited resolution and sensory bandwidth. To address the limitations of single sensory modalities, this study proposes a hybrid multimodal interface that decomposes 3D translation into 2D horizontal vibrotactile and 1D vertical auditory cues. We investigated the optimal spatial reference frame for this system through a within-subjects VR user (N = 12), comparing a Local reference frame (Wrist) against Global reference frames (Ankle, Torso-4, and Torso-8) during bimanual reaching and tracking tasks. Bayesian multilevel analysis revealed that Global reference frames consistently outperformed the Local frame in both tracking accuracy and subjective workload. These performance gains are primarily attributed to the elimination of the cognitive cost associated with continuous mental rotation required by hand-centered mappings. Furthermore, while higher haptic resolution (Torso-8) improved subjective confidence, it did not substantially enhance task performance compared to lower-resolution global conditions. These findings suggest that decoupling spatial dimensions across auditory and tactile modalities using global reference frames is an effective strategy for 3D guidance, reducing cognitive load and enhancing coordination in bimanual tasks, and highlight the need for future studies exploring other multimodal configurations.
Suzuki et al. (Sun,) studied this question.