Key points are not available for this paper at this time.
Currently, speech emotion recognition models still could not show satisfactory performance due to the complexity of emotions. In most of the previous studies, there is a common problem that some of the particular emotions are severely misclassified. In this article, we propose a novel framework integrating cascaded attention network and adversarial joint loss strategy for speech emotion recognition, aiming at discriminating the confusions by emphasizing more on the emotions which are difficult to be correctly classified. First, we extract log-Mels, deltas and delta-deltas of log-Mels as 3D features to effectively reduce the interference of external factors. Next, we introduce a cascaded attention network to extract effective emotional features, where spatiotemporal attention selectively locates the targeted emotional regions from the input features. In these targeted regions, the self attention with head fusion captures the long-distance dependence of temporal features. Finally, an adversarial joint loss strategy is proposed to distinguish the emotional embeddings with high similarity by the generated hard triplets in an adversarial fashion. To evaluate our proposed method, experiments are performed with the IEMOCAP, CASIA, and EMODB corpora. The experimental results demonstrate that our proposed method significantly outperforms the state-of-the-art approaches on all datasets.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yang Liu
Haoqin Sun
Wenbo Guan
IEEE/ACM Transactions on Audio Speech and Language Processing
Chinese Academy of Sciences
Institute of Automation
Qingdao University of Science and Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Liu et al. (Sun,) studied this question.
www.synapsesocial.com/papers/6a01bc41897643a80dcb0437 — DOI: https://doi.org/10.1109/taslp.2023.3245401