Key points are not available for this paper at this time.
Our objective is language-based search of large-scale image and video. For this task, the approach that consists of independently mapping and vision to a joint embedding space, a. k. a. dual encoders, is attractive retrieval scales and is efficient for billions of images using approximate neighbour search. An alternative approach of using vision-text with cross-attention gives considerable improvements in accuracy the joint embeddings, but is often inapplicable in practice for-scale retrieval given the cost of the cross-attention mechanisms required each sample at test time. This work combines the best of both worlds. We the following three contributions. First, we equip transformer-based with a new fine-grained cross-attention architecture, providing improvements in retrieval accuracy whilst preserving scalability. , we introduce a generic approach for combining a Fast dual encoder model our Slow but accurate transformer-based model via distillation and-ranking. Finally, we validate our approach on the Flickr30K image dataset we show an increase in inference speed by several orders of magnitude having results competitive to the state of the art. We also extend our to the video domain, improving the state of the art on the VATEX.
Miech et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: