Los puntos clave no están disponibles para este artículo en este momento.
This paper presents Herrmann-1 1 , a multimodal framework to generate background music tailored to movie scenes, by integrating state-of-the-art vision, language, music, and speech processing models. Our pipeline begins by extracting visual and speech information from a movie scene, performing emotional analysis on it, and converting these into descriptive texts. Then, GPT-4 translates these high-level descriptions into low-level music conditions. Finally, these text-based music conditions guide a text-to-music model to generate music that resonates with input movie scenes. Comprehensive objective and subjective evaluations attest to the high synthesis quality, congruence, and superiority of our pipeline.
Haseeb et al. (Mon,) studied this question.