What question did this study set out to answer?

The research aims to enhance interactive behavior recognition in AR teaching systems by integrating multimodal data.

April 1, 2026

Research on intelligent recognition algorithm of AR enhanced teaching interaction based on multimodal data fusion

Key Points

The research aims to enhance interactive behavior recognition in AR teaching systems by integrating multimodal data.
Developed an algorithm framework for recognizing gestures, voices, and head movements.
Used a sliding window compensation algorithm for time alignment between modalities.
Extracted features from 3D gesture key points, speech, gyroscope, and accelerometer data using advanced modeling techniques.
Implemented feature-level fusion with cross-modal attention and decision-level fusion using a multi-expert model.
Achieved 94.9% accuracy and 92.3% F1 score with three-mode data fusion.
Reduced false alarm rate to 5.8%, outperforming single-mode fusion.
Demonstrated effectiveness across various teaching scenarios while highlighting limitations in complex interactions.

Abstract

Aiming at the bottleneck of interactive behavior recognition in the existing AR teaching system, that is, it is difficult for single modal data to meet the needs of complex dynamic interaction and the lack of in-depth modeling of the temporal and spatial correlation of multimodal data, an algorithm framework integrating multimodal data such as gestures, voices and head gestures is constructed in this study. The AR equipment integrates multiple sensors to collect data cooperatively, and the sliding window compensation algorithm is used to achieve time alignment. In the aspect of feature extraction, 3D coordinates of gesture key points were extracted based on mediapipe to construct spatiotemporal map, and spatiotemporal features were extracted using lightweight spatiotemporal graph convolution network (ST-GCN), Speech features were extracted using the improved wav2vec 2.0 model, The data of gyroscope and accelerometer are fused to calculate the continuous attitude parameters. A two-way parallel fusion structure is designed. The feature-level fusion uses cross-modal attention gating mechanism, and the decision-level fusion is based on multi-expert mixed model. The LSTM network dynamically generates gating weights. Furthermore, the method of combining ST-GCN with Transformer semantic encoder is used to model the depth of multi-modal features after fusion. The experiment was carried out on the self-built AR teaching data set AMIB-6, which contains 6 kinds of typical teaching behaviors and 120 hours of multimodal data of 50 subjects. The experimental results show that the accuracy of three-mode fusion can reach 94.9%, the F1 value is 92.3%, and the false alarm rate is reduced to 5.8%, which is significantly better than that of single-mode fusion. Ablation experiments verify the effectiveness of key modules such as cross-modal attention mechanism. The scene adaptability analysis shows that the model performs well in medical anatomy, engineering disassembly and scientific experiment scenes, but it is limited by the recognition ambiguity caused by cross-talk in small sample and multi-user interaction scenes. This study provides a new idea for intelligent identification of AR teaching interaction behavior, which is helpful to improve the interactive performance and user experience of AR teaching system.

Bookmark

Cite This Study

Zhang et al. (Sun,) studied this question.

synapsesocial.com/papers/69ccb6ce16edfba7beb8893b https://doi.org/https://doi.org/10.1049/icp.2026.0175

Bookmark