November 28, 2025

A Multimodal Deep Learning Model for Optimizing Music Emotion Recognition Through Temporal and Semantic Feature Integration

Key Points

Key points are not available for this paper at this time.

Abstract

Music emotions are subjective and diverse, leading to inconsistent labeling. Traditional methods have limitations in handling audio time-frequency features and long-term dependencies. The current music emotion recognition (MER) focuses on a single audio modality or lyrics text modality, and cannot consider both signal features and semantic features, resulting in low recognition accuracy. This paper uses deep learning (DL) to study the accuracy of optimizing MER. A hybrid model architecture that combines a convolutional neural network (CNN) and a bidirectional long short term memory network (Bi-LSTM) is built based on multimodal information gathering in order to thoroughly assess the emotional qualities of music from a variety of perspectives. First, the temporal information is modeled after multimodal high-level characteristics are retrieved from the audio and lyrics text; then, a multi-scale convolution kernel strategy is used to improve the model’s ability to capture frequency domain features; finally, the attention mechanism is used to dynamically adjust the key time steps to enhance the accuracy of MER. According to experimental data, compared with complex multi-scale CNN models, the accuracy of the paper’s model has been improved by 0.92%, 8.36%, 1.92%, and 8.16%, respectively. The combination of DL, time series modeling, and multimodal feature extraction can successfully improve the accuracy of MER and provide new references for emotional analysis of music.

Bookmark

A Multimodal Deep Learning Model for Optimizing Music Emotion Recognition Through Temporal and Semantic Feature Integration

Key Points

Abstract

Cite This Study