Key points are not available for this paper at this time.
Traditional relation extraction methods are usually based on single text data, and other modality information such as image and video can improve the effect of text relation extraction. Aiming at the problem of heterogeneity between multi-modal data, a multi-modal relation extraction model MRECL based on contrastive learning is constructed. Firstly, the feature representation of image and text is obtained through ViT and BERT models; secondly, contrastive learning is introduced to align image-text features to reduce modality heterogeneity in advance; then image is used as prefix guidance information, and multi-modal interaction fusion is carried out with text information through attention mechanism; finally, the probability of all relation classification is calculated through Softmax layer and the prediction relation is output. Experimental results show that the precision, recall, and F1 values of MRECL on the MNRE dataset reach 83.23%, 82.19%, and 82.70%, respectively, achieving better performance than the current SOTA model.
Zhao et al. (Fri,) studied this question.