Los puntos clave no están disponibles para este artículo en este momento.
An intuitive way to search for images is to use queries composed of an image and a complementary text. While the first provides rich and context for the search, the latter explicitly calls for new traits, or how some elements of the example image should be changed to retrieve desired target image. Current approaches typically combine the features of of the two elements of the query into a single representation, which can be compared to the ones of the potential target images. Our work aims at new light on the task by looking at it through the prism of two and related frameworks: text-to-image and image-to-image retrieval. inspiration from them, we exploit the specific relation of each query with the targeted image and derive light-weight attention mechanisms enable to mediate between the two complementary modalities. We validate approach on several retrieval benchmarks, querying with images and their free-form text modifiers. Our method obtains state-of-the-art without resorting to side information, multi-level features, heavy-training nor large architectures as in previous works.
Delmas et al. (Tue,) studied this question.