Research has shown that enhanced perception and inference capabilities enable socially intelligent robots to make more informed decisions. In human‐aware interactions, a reliable understanding of communication cues, both verbal and non‐verbal, is essential. However, limited real‐world evidence exists on how multimodal perception can compensate for the shortcomings of single modalities. This study investigates the combined use of verbal and non‐verbal cues for orchestrating robot navigation and motion control. We integrate a human pose estimation pipeline for gesture and posture recognition with a virtual assistant pipeline enhanced by a local LLM for natural language understanding. The gesture module is designed for single‐user interaction, although the system is also tested in scenarios where a second participant is present. Participants in a living lab provided gestures, poses, and voice commands to a mobile service robot. The system uses LiDAR and an RGB‐D camera for trajectory tracking and gesture recognition, a microphone accompanied by a natural language processing pipeline, and a local LLM for interpreting spoken commands and generating navigation instructions. Performance is evaluated in terms of motion control accuracy and conflict resolution between modalities. Results demonstrate recognition accuracies of 95% for gesture‐based and 93% for voice‐based commands, with no conflicts during multimodal arbitration. These findings indicate that multimodal cues can enhance reliability, safety, efficiency, and robustness in noisy or occluded environments. Overall, the study provides an engineering approach and empirical evidence supporting multimodal perception and inference in human–robot interaction, highlighting its potential for developing autonomous, socially aware assistants in ubiquitous environments.
Aboki et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: