Existing interactive AI (Artificial Intelligence) virtual human systems mostly rely on static face recognition, which makes it difficult to accurately recognize non-verbal information such as dynamic expressions, gestures, and postures, resulting in delayed interactive responses and biased situational understanding. Traditional single-modal recognition methods have limitations in modeling temporal features and modal fusion. This paper proposes a multimodal recognition mechanism that integrates expressions, gestures, and postures and builds a feature-integration framework based on a visual Transformer and a graph convolutional network to improve the virtual human’s recognition of complex visual signals. First, MobileFaceNet and Bi-LSTM (Bidirectional Long Short-Term Memory) are used to jointly process multi-frame facial image sequences, extracting dynamic expression change trajectories to capture continuity and small emotional fluctuations. On this basis, MediaPipe and HRNet (High-Resolution Network) are combined to extract hand keypoints and finger spacing, enabling high-precision recognition of complex gestures. The graph convolutional network is further used to perform topological modeling of the whole-body skeletal points, thereby completing the mapping of gesture expressions and interactive semantics. Finally, the features of expression, gesture, and posture are unified and integrated through the Transformer encoding structure. The modal weights are regulated by the soft attention mechanism to drive the virtual human to generate context-adaptive feedback behaviors. Experiments show that the accuracy of this method in expression, gesture, and posture recognition tasks is 93.7%, 92.4%, and 94.0%, respectively, and the average mAP (mean Average Precision) is 93.4%, which is significantly better than 80.3% of ResNet-50. Posture recognition still maintains an accuracy of 88.6% in fast-motion scenarios, with an F1-score of up to 0.93 and a multimodal fusion module stability rate of 94.4%, verifying its cross-environment stability and real-time interaction capabilities.
Guo et al. (Sun,) studied this question.