July 19, 2024

Dual-encoder-based image-text fusion algorithm

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Many sectors are challenged by how to effectively represent knowledge in files that contain multiple images closely related to text, and how to make models understand the relationship between images and text. Contrastive Language-Image Pre-training (CLIP) and Bootstrapping Language-Image Pre-training (BLIP) acquire the capability of understanding the image-text relationship through large-scale model pre-training. CLIP not only considers images and their related text but also contrasts images with massive irrelevant text, to improve its capability of generalizing the relationship between images and related text. BLIP enhances its understanding of complex image-text relationships by pre-training and fine-tuning matched image-text pairs. This paper presents an image-text fusion algorithm based on CLIP and BLIP, which gives an accurate and consistent picture of the image-text relevance by fully using CLIP's image-text generalization capacity and BLIP's capacity of understanding the complex image-text relationship.

Preguntar a la IA

Me gusta

Guardar

Cite This Study

xia et al. (Fri,) studied this question.

synapsesocial.com/papers/68e5fb7ab6db64358758fa7a https://doi.org/https://doi.org/10.1117/12.3035185

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Preguntar a la IA

Me gusta

Guardar