The heterogeneity and semantic fragmentation of multimodal data in virtual reality education make it difficult to accurately model learner states. To address this, this paper proposes a cross-modal contrastive learning and dynamic graph attention fusion network. This method first performs temporal encoding on modalities such as eye-tracking, speech, pose, and interaction logs. Then, it aligns semantically related multimodal segments in a unified latent space through cross-modal contrastive learning. It then constructs a heterogeneous graph with time steps as nodes and dynamic correlations between modalities as edges. A modality-aware graph attention mechanism is introduced to adaptively aggregate the contributions of each modality at different time points. Finally, a graph neural network is used to generate a fusion representation and drive the educational state discrimination task. Experiments on a self-built VR education multimodal dataset demonstrate that the proposed model achieves an accuracy of 87.6% and an F1 score of 0.912 on the cognitive load level classification task. This model effectively achieves high-precision and interpretable modeling of learner states in virtual reality education scenarios.
Xi et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: