Los puntos clave no están disponibles para este artículo en este momento.
The remarkable emergence of large language models (LLM) and their vast capabilities have opened a possibility for applications in various fields, including speech emotion recognition (SER). Despite the advancement of SER methods and the abundance of speech data, the requirement of having speech data labeled with emotions is a challenge to fulfill, considering the cost of human annotation. In this study, we propose utilizing LLM to annotate emotional speeches, investigating the use of conversation sequence transcription, and incorporating the textual acoustic feature descriptors into the prompt. Furthermore, we also examine the application of annotation results on emotional speeches as training and augmentation data. Our experiment using the IEMOCAP dataset shows that emotional speech annotation using LLMs can outperform human annotation with possibly lower annotation costs. The SER trained using the annotation result as a whole training data or augmentation data reaches a performance close to state-of-the-art SER methods.
Santoso et al. (Mon,) studied this question.