English-speaking anxiety significantly impacts learners’ communication, and automated diagnosis is crucial. Existing multimodal approaches face limitations in deep feature extraction and modality fusion. This study proposes MSATC, a co-attention-based multimodal fusion model. The speech S-ABHC model extracts deep acoustic features using MFCCs, spectrograms, raw waveforms, BiGRU, and HuBERT, while RoBERTa captures text sentiment. The co-attention module enables bidirectional, adaptive feature interaction, producing discriminative joint representations. Experiments on IEMOCAP show 77.41% weighted and 78.66% unweighted accuracy, outperforming existing models.
Zhou et al. (Sun,) studied this question.