Key points are not available for this paper at this time.
In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images. It directly models the probability distribution of generating a word given previous words and the image. Image descriptions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on three benchmark datasets (IAPR TC-12, Flickr 8K, and Flickr 30K). Our model outperforms the state-of-the-art generative method. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.
Building similarity graph...
Analyzing shared references across papers
Loading...
Junhua Mao
University of Science and Technology of China
Wei Xu
Southwest University
Yi Yang
South China Agricultural University
Building similarity graph...
Analyzing shared references across papers
Loading...
Mao et al. (Sat,) studied this question.
synapsesocial.com/papers/6a0b3948393ef274532e19c1 — DOI: https://doi.org/10.48550/arxiv.1410.1090