Aiming at the bottleneck of interactive behavior recognition in the existing AR teaching system, that is, it is difficult for single modal data to meet the needs of complex dynamic interaction and the lack of in-depth modeling of the temporal and spatial correlation of multimodal data, an algorithm framework integrating multimodal data such as gestures, voices and head gestures is constructed in this study. The AR equipment integrates multiple sensors to collect data cooperatively, and the sliding window compensation algorithm is used to achieve time alignment. In the aspect of feature extraction, 3D coordinates of gesture key points were extracted based on mediapipe to construct spatiotemporal map, and spatiotemporal features were extracted using lightweight spatiotemporal graph convolution network (ST-GCN), Speech features were extracted using the improved wav2vec 2.0 model, The data of gyroscope and accelerometer are fused to calculate the continuous attitude parameters. A two-way parallel fusion structure is designed. The feature-level fusion uses cross-modal attention gating mechanism, and the decision-level fusion is based on multi-expert mixed model. The LSTM network dynamically generates gating weights. Furthermore, the method of combining ST-GCN with Transformer semantic encoder is used to model the depth of multi-modal features after fusion. The experiment was carried out on the self-built AR teaching data set AMIB-6, which contains 6 kinds of typical teaching behaviors and 120 hours of multimodal data of 50 subjects. The experimental results show that the accuracy of three-mode fusion can reach 94.9%, the F1 value is 92.3%, and the false alarm rate is reduced to 5.8%, which is significantly better than that of single-mode fusion. Ablation experiments verify the effectiveness of key modules such as cross-modal attention mechanism. The scene adaptability analysis shows that the model performs well in medical anatomy, engineering disassembly and scientific experiment scenes, but it is limited by the recognition ambiguity caused by cross-talk in small sample and multi-user interaction scenes. This study provides a new idea for intelligent identification of AR teaching interaction behavior, which is helpful to improve the interactive performance and user experience of AR teaching system.
Zhang et al. (Sun,) studied this question.