What question did this study set out to answer?

This research explores how combining verbal and non-verbal cues can improve robot navigation and interaction.

April 30, 2026Open Access

Multimodal Human–Robot Interaction Using Human Pose Estimation and Local Large Language Models

Key Points

This research explores how combining verbal and non-verbal cues can improve robot navigation and interaction.
Utilized human pose estimation for gesture and posture recognition.
Integrated natural language processing with a local large language model for voice command interpretation.
Evaluated performance based on motion control accuracy in a living lab setting.
Achieved 95% accuracy in gesture recognition and 93% in voice command recognition.
No conflicts arose during multimodal arbitration, ensuring smooth operation.
Findings suggest that multimodal approaches enhance reliability and efficiency in challenging environments.

Abstract

Research has shown that enhanced perception and inference capabilities enable socially intelligent robots to make more informed decisions. In human‐aware interactions, a reliable understanding of communication cues, both verbal and non‐verbal, is essential. However, limited real‐world evidence exists on how multimodal perception can compensate for the shortcomings of single modalities. This study investigates the combined use of verbal and non‐verbal cues for orchestrating robot navigation and motion control. We integrate a human pose estimation pipeline for gesture and posture recognition with a virtual assistant pipeline enhanced by a local LLM for natural language understanding. The gesture module is designed for single‐user interaction, although the system is also tested in scenarios where a second participant is present. Participants in a living lab provided gestures, poses, and voice commands to a mobile service robot. The system uses LiDAR and an RGB‐D camera for trajectory tracking and gesture recognition, a microphone accompanied by a natural language processing pipeline, and a local LLM for interpreting spoken commands and generating navigation instructions. Performance is evaluated in terms of motion control accuracy and conflict resolution between modalities. Results demonstrate recognition accuracies of 95% for gesture‐based and 93% for voice‐based commands, with no conflicts during multimodal arbitration. These findings indicate that multimodal cues can enhance reliability, safety, efficiency, and robustness in noisy or occluded environments. Overall, the study provides an engineering approach and empirical evidence supporting multimodal perception and inference in human–robot interaction, highlighting its potential for developing autonomous, socially aware assistants in ubiquitous environments.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper