August 13, 2024

Image-text multimodal classification via cross-attention contextual transformer with modality-collaborative learning

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Nowadays, we are surrounded by various types of data from different modalities, such as text, images, audio, and video. The existence of this multimodal data provides us with rich information, but it also brings new challenges: how do we effectively utilize this data for accurate classification? This is the main problem faced by multimodal classification tasks. Multimodal classification is an important task that aims to classify data from different modalities. However, due to the different characteristics and structures of data from different modalities, effectively fusing and utilizing them for classification is a challenging problem. To address this issue, we propose a cross-attention contextual transformer with modality-collaborative learning for multimodal classification (CACT-MCL-MMC) to better integrate information from different modalities. On the one hand, existing multimodal fusion methods ignore the intra- and inter-modality relationships, and there is unnoticed information in the modalities, resulting in unsatisfactory classification performance. To address the problem of insufficient interaction of modality information in existing algorithms, we use a cross-attention contextual transformer to capture the contextual relationships within and among modalities to improve the representativeness of the model. On the other hand, due to differences in the quality of information among different modalities, some modalities may have misleading or ambiguous information. Treating each modality equally may result in modality perceptual noise, which reduces the performance of multimodal classification. Therefore, we use modality-collaborative to filter misleading information, alleviate the quality difference of information among modalities, align modality information with high-quality and effective modalities, enhance unimodal information, and obtain more ideal multimodal fusion information to improve the model's discriminative ability. Our comparative experimental results on two benchmark datasets for image-text classification, CrisisMMD and UPMC Food-101, show that our proposed model outperforms other classification methods and even state-of-the-art (SOTA) multimodal classification methods. Meanwhile, the effectiveness of the cross-attention module, multimodal contextual attention network, and modality-collaborative learning was verified through ablation experiments. In addition, conducting hyper-parameter validation experiments showed that different fusion calculation methods resulted in differences in experimental results. The most effective feature tensor calculation method was found. We also conducted qualitative experiments. Compared with the original model, our proposed model can identify the expected results in the vast majority of cases. The codes are available at https://github.com/KobeBryant8-24-MVP/CACT-MCL-MMC. The CrisisMMD is available at https://dataverse.mpisws.org/dataverse/icwsm18, and the UPMC-Food-101 is available at https://visiir.isir.upmc.fr/.

Me gusta

Guardar