Los puntos clave no están disponibles para este artículo en este momento.
This research study aims to create a streamlined way for generating a musical soundtrack from an image. The proposed pipeline employs extracting necessary features from an image, such as the context in terms of text captioning and the sentiment portrayed by the image. Using these extracted features, Meta's music generation model, MusicGen, generates a soundtrack. This study also introduces a new "pace" model, which was trained on a manually annotated dataset of around 8,000 images. Based on a conducted survey, the pace model proved to be a necessary and impactful addition to the pipeline. This model extracts the "pace" or "tempo" for a soundtrack from the image. We discuss the methods used along with the underlying architectures and demonstrate their effectiveness in creating a suitable auditory experience. Another implemented functionality is the generation of a slideshow with music, with the input being a set of similar images. The experiment highlights potential applications in multimedia, immersive storytelling, and more.
Vasireddy et al. (Wed,) studied this question.