What question did this study set out to answer?

The research aims to improve speech emotion recognition by integrating multimodal data and enhancing emotional representation through PAD annotations.

April 4, 2026Open Access

PAD-Guided Multimodal Hybrid Contrastive Emotion Recognition upon STEM-E2VA Dataset

Key Points

The research aims to improve speech emotion recognition by integrating multimodal data and enhancing emotional representation through PAD annotations.
Constructed the STEM-E2VA dataset with four data modalities: articulatory kinematics, acoustics, glottal signals, and videos.
Utilized PAD continuous annotation alongside discrete emotion categories to enrich emotional representation.
Employed a multimodal supervised contrastive fusion network with a PAD-enhanced hybrid contrastive loss function.
Implemented a GRU–Transformer network for effective temporal feature extraction.
Achieved 85.47% accuracy in discrete sentiment recognition on the STEM-E2VA dataset.
Significantly reduced RMSE for PAD dimension predictions, indicating improved accuracy.
Demonstrated strong generalization capabilities on the IEMOCAP dataset.

Abstract

There are still challenges in speech emotion recognition, as the representation capability of single-modal information is limited, there are difficulties in capturing continuous emotional transitions in discrete emotion annotations, and the issues of modal structural differences and cross-sample alignment in multimodal fusion methods persist. To address these, this study undertakes work from both data and model perspectives. For data, a Chinese multimodal database STEM-E2VA was constructed, synchronously collecting four modalities of data: articulatory kinematics, acoustics, glottal signals, and videos. This covers seven discrete emotion categories and employs PAD continuous annotation. By integrating discrete and continuous dimensional annotations, it better represents the distinction between strong and weak emotions under the same discrete emotion label. Concurrently, to process the biases in PAD annotations, we employed the SCL-90 psychological questionnaire to analyze annotators’ cognitive and emotional perceptions, thereby ensuring data reliability. For model, this paper proposes a multimodal supervised contrastive fusion network incorporating PAD perception. It employs a PAD-enhanced hybrid contrastive loss function to optimize intra-model and inter-modal feature alignment. Utilizing a cross-attention mechanism combined with a GRU–Transformer network for temporal feature extraction, it achieves deep fusion of multimodal information, reducing inter-modal discrepancies and cross-class confusion. Experiments demonstrate that the proposed method achieves 85.47% accuracy in discrete sentiment recognition on STEM-E2VA, with a substantial reduction in RMSE for PAD dimension prediction. It also exhibits excellent generalization capability on IEMOCAP, providing a novel framework for integrating discrete and continuous sentiment representations.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Duan et al. (Thu,) studied this question.

synapsesocial.com/papers/69d0af83659487ece0fa57db https://doi.org/https://doi.org/10.3390/mti10040038

Bookmark

View Full Paper