October 4, 2014Open Access

Explain Images with Multimodal Recurrent Neural Networks

Key Points

Key points are not available for this paper at this time.

Abstract

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images. It directly models the probability distribution of generating a word given previous words and the image. Image descriptions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on three benchmark datasets (IAPR TC-12, Flickr 8K, and Flickr 30K). Our model outperforms the state-of-the-art generative method. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Junhua Mao

University of Science and Technology of China

Wei Xu

Southwest University

Yi Yang

South China Agricultural University

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Explain Images with Multimodal Recurrent Neural Networks

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study