We study emotion-aware speech-to-text translation (ST) through the lens of generative error correction (GER) with large language models (LLMs). First, we enhance the translation of emotional speech by adopting the GER paradigm: Finetune an LLM to generate the translation based on the decoded N-best hypotheses. Next, we combine the emotion and sentiment labels into the LLM finetuning process to enable the model to consider the emotion content. Moreover, we introduce Describe-then-Translate (DtT), a simple yet effective rationale-style supervision that makes the GER model first predict the emotion with a brief natural-language sentence and then generate the translation, which aligns with the LLM’s language-modeling objective. We conducted experiments on the English-Chinese BMELD dataset with different N-best generation strategies and GER models. The results show the effectiveness of the combination of GER and emotion/sentiment labels, and DtT improves over label-only emotion supervision.
Yang et al. (Thu,) studied this question.