This study proposes an emotion recognition and generation model for cello performance oriented toward intelligent music education and emotional interaction. By fusing the Waveform-based Language Model (WavLM) self-supervised model, performance dynamic features (volume, rhythm, glissando), and a three-stage generation module based on diffusion transformers, the model's capabilities in emotion recognition and audio generation are enhanced. Combined with a discriminator and a joint training mechanism, the model achieves a 5.7% improvement in Accuracy and a 0.069 increase in Macro-F1 on the Database for Emotional Analysis in Music (DEAM) dataset. The generation module outperforms existing models in metrics including Mean Opinion Score (MOS) (4.12), Perceptual Evaluation of Speech Quality (PESQ) (3.48), and Fréchet Audio Distance (FAD) (2.74), with an emotional expression accuracy of 81.6%. The discriminator module achieves a Macro-F1 of 0.762 and an MSE of 0.0207. The joint training strategy significantly improves generation quality, with a Generated Quality Index (GQI) of 0.39. Results indicate that multimodal fusion and diffusion modeling effectively enhance the understanding and generation quality of musical emotions.
Yang Liu (Fri,) studied this question.