In response to the challenges of using traditional multimodal human–computer interaction models to understand complex user intentions in dynamic scenarios and traditional multimodal human–computer interaction models' heavy dependency of multimodal data on strong supervision signals, processing of high-dimensional heterogeneous modal data captures slow operating speed with poor robustness, this paper develops an artificial intelligence (AI)-powered multimodal human–computer interaction algorithm framework. In the first step, it builds a unified latent space by cross-modal self-supervised contrastive learning to develop implicit multimodal alignment and mitigate semantic alignment issues. Secondly, applied Spatio-Temporal Graph Neural Network (ST-GNN) to improve user intent recognition ability in dynamic scenarios, employing graph attention mechanisms to exploit spatiotemporal dependencies of multimodal behaviors. Thirdly, the warm-starting a human–computer interaction model, merged two different action value estimation techniques, Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) in a hybrid environment, thereby optimizing action selection and agent policy optimization. Finally, implemented model-agnostic meta-learning (MAML) to enable a quick process of user personalized learning under low sample conditions and for joint multi-sourced signal optimization through multimodal loss function above a single point of execution for user intent recognition accuracy. Experimental results show that ST-GNN achieves an average accuracy of 96.22% in intent recognition tasks, while DQN-PPO reduces single-step decision time to 8.5 ms and memory usage in interactive tasks to 12.8%. The algorithmic framework in this study incorporates ST-GNN, DQN-PPO, and MAML, which significantly enhances the accuracy of recognizing user intent presented in dynamic context through improved robustness in real-time system responsiveness, which provides a technical basis for user interaction within digital media networks.
Du et al. (Thu,) studied this question.