Key points are not available for this paper at this time.
In this paper, we propose a multimodal setting in real-world scenarios based on weighting and meta-learning combination methods that integrate the output probabilities obtained from text and visual classifiers. While the classifier built on the concatenation of text and visual features may worsen the results, the model described in this paper can increase classification accuracy to over 6%. Typically, text or images are used in classification; however, ambiguity in either text or image may reduce the performance. This leads to combine text and image of an object or a concept in a multimodal approach to enhance the performance. In our approach, a text classifier is trained on Bag of Words and a visual classifier is trained on features extracted through a Deep Convolutional Neural Network. We created a new dataset of real-world texts and images called Ferramenta. Some of the images and related texts in this dataset contain ambiguities, which is an ideal situation to test a multimodal approach. Experimental results reported on Ferramenta and PASCAL VOC2007 datasets indicate that the combination methods described performs better in a multimodal setting.
Gallo et al. (Wed,) studied this question.