June 10, 2024Open Access

Multi Modal Fusion for Video Retrieval based on CLIP Guide Feature Alignment

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

With the rise of short video platforms, a large amount of video data is generated daily.These videos vary in quality and are not welltagged.How to fully utilize the multimodal information in videos, bridge the differences between modalities, and achieve precise video retrieval is a major challenge currently faced in the field of video retrieval.This paper presents a novel approach to multimodal video retrieval, aiming to boost search precision by incorporating visual, textual, and audio information through the CLIP model and T5.Tackling the issue of retrieving pertinent content from extensive, untagged video repositories, we propose a method that fuses multimodal data through innovative feature extraction and alignment techniques.Our method showcases performance are close to the current state-of-the-art, showcasing its effectiveness in improving search accuracy on MSR-VTT benchmark.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo

Cite This Study

Wu et al. (Mon,) studied this question.

synapsesocial.com/papers/68e65761b6db6435875e5d14 https://doi.org/https://doi.org/10.1145/3664524.3675369

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Me gusta

Guardar

Ver artículo completo