Los puntos clave no están disponibles para este artículo en este momento.
Existing attention models of image captioning typically extract only word-level attention information. i.e., the attention mechanism extracts local attention information from the image to generate the current word. We propose an image captioning approach based on self-attention to utilize image features more effectively. The self-attention mechanism can extract sentence-level attention information with richer visual representation from images. Furthermore, we propose a double attention model. The model combines sentence-level and word-level attention information to better simulate human perception system. We implement supervision and optimization in the intermediate stage of the model to solve over-fitting and information interference problems, and we apply reinforcement learning to two-stage training to optimize the evaluation metrics of the model. Finally, we evaluate our model on MSCOCO dataset. The experimental results show that our approach can generate more accurate and richer captions, and outperforms many state-of-the-art image captioning approaches on various evaluation metrics.
Wei et al. (Mon,) studied this question.