What question did this study set out to answer?

The aim is to develop an efficient framework for multimodal human-computer interaction that enhances user intent recognition.

March 28, 2026Open Access

Analysis of multimodal human–computer interaction algorithms in artificial intelligence-driven digital media networks

Key Points

The aim is to develop an efficient framework for multimodal human-computer interaction that enhances user intent recognition.
Build a unified latent space through self-supervised contrastive learning.
Utilize Spatio-Temporal Graph Neural Network (ST-GNN) for user intent recognition.
Merge Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) for action selection.
Implement model-agnostic meta-learning (MAML) for quick user personalized learning.
ST-GNN achieves 96.22% accuracy in intent recognition tasks.
DQN-PPO reduces decision time to 8.5 ms.
Memory usage in interactive tasks is decreased to 12.8%.

Abstract

In response to the challenges of using traditional multimodal human–computer interaction models to understand complex user intentions in dynamic scenarios and traditional multimodal human–computer interaction models' heavy dependency of multimodal data on strong supervision signals, processing of high-dimensional heterogeneous modal data captures slow operating speed with poor robustness, this paper develops an artificial intelligence (AI)-powered multimodal human–computer interaction algorithm framework. In the first step, it builds a unified latent space by cross-modal self-supervised contrastive learning to develop implicit multimodal alignment and mitigate semantic alignment issues. Secondly, applied Spatio-Temporal Graph Neural Network (ST-GNN) to improve user intent recognition ability in dynamic scenarios, employing graph attention mechanisms to exploit spatiotemporal dependencies of multimodal behaviors. Thirdly, the warm-starting a human–computer interaction model, merged two different action value estimation techniques, Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) in a hybrid environment, thereby optimizing action selection and agent policy optimization. Finally, implemented model-agnostic meta-learning (MAML) to enable a quick process of user personalized learning under low sample conditions and for joint multi-sourced signal optimization through multimodal loss function above a single point of execution for user intent recognition accuracy. Experimental results show that ST-GNN achieves an average accuracy of 96.22% in intent recognition tasks, while DQN-PPO reduces single-step decision time to 8.5 ms and memory usage in interactive tasks to 12.8%. The algorithmic framework in this study incorporates ST-GNN, DQN-PPO, and MAML, which significantly enhances the accuracy of recognizing user intent presented in dynamic context through improved robustness in real-time system responsiveness, which provides a technical basis for user interaction within digital media networks.

Bookmark

View Full Paper

Cite This Study

Du et al. (Thu,) studied this question.

synapsesocial.com/papers/69c770f78bbfbc51511e0d5f https://doi.org/https://doi.org/10.1007/s10791-026-10074-4

Bookmark

View Full Paper