Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as the and of. Other words that may seem visual can often be predicted reliably just from the language model e.g., sign after behind a red stop or phone following talking on a cell. In this paper, we propose a novel adaptive attention model with a visual sentinel. At each time step, our model decides whether to attend to the image (and if so, to which regions) or to the visual sentinel. The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation. We test our method on the COCO image captioning 2015 challenge dataset and Flickr30K. Our approach sets the new state-of-the-art by a significant margin.

Me gusta

Guardar

Me gusta

Guardar

Cite This Study

Lu et al. (Sat,) studied this question.

synapsesocial.com/papers/69d722af8a0e2c5879bef682 https://doi.org/https://doi.org/10.1109/cvpr.2017.345

Also Consider

Synapse has enriched 2 closely related papers on similar clinical questions. Consider them for comparative context:

Also Consider

Synapse has enriched 2 closely related papers on similar clinical questions. Consider them for comparative context: