To solve the problems of modal heterogeneity, temporal asynchrony and cognitive adaptation imbalance in multimodal real-time interaction, a CLT-driven multi-modal real-time fusion architecture was proposed.Experimental verification on HoloAssist dataset shows that the interactive intention prediction accuracy of the proposed architecture reaches 95.2% ± 1.3%, which is 3.5 percentage points higher than that of AlignMamba model.The end-to-end delay is 0.18 s ± 0.02 s, and the alignment delay is as low as 0.028 s.The subjective score of cognitive load was 3.2 ± 0.8, which was significantly better than the baseline model.Ablation experiments confirm that each core module is crucial to performance improvement, and the model has excellent robustness in scenarios with modal loss and noise interference.This research provides support for the implementation of real-time multimodal interaction technology.
Li et al. (Thu,) studied this question.