What question did this study set out to answer?

This research aims to address the learning imbalance in multimodal learning by introducing a new multilabel objective.

April 6, 2026

Modality-Mix Learning: Promoting Multimodal Learning Through Multilabel Objective

Key Points

This research aims to address the learning imbalance in multimodal learning by introducing a new multilabel objective.
Proposed modality-mix learning (MM learning) that generates modality-mixed samples.
Implemented a bilevel learning scheme to capture general features before optimized multimodal learning.
Integrated varied labels into a probability vector to represent multilabel information.
MM learning significantly enhances the performance of fusion strategies in multimodal datasets.
Improved robustness of multimodal networks identified through experimental results.

Abstract

Multimodal fusion provides a comprehensive way to understand the world by integrating data from different sources. However, some studies believe that due to the optimization imbalance, partial modalities cannot be fully learned during multimodal learning. They attempt to achieve the balance between different modalities by controlling their learning process but ignore the function of the learning objective as an essential factor. The uniform objective for all modalities leads to the network being unable to sufficiently exploit discriminative information from different modalities. Therefore, we propose a new multimodal learning method, namely, modality-mix learning (MM learning), aiming to promote the sufficient learning of each modality via the designed multilabel objective. MM learning generates modality-mixed samples by combining modalities of different samples with varied labels, transforming the single label of a sample into a probability vector representing multilabel information. These modality-mixed samples are then fed into the network, which is trained to recognize the varying proportions of multilabel information. In addition, we introduce a bilevel learning scheme, where the network is first trained using standard learning to capture general features and select samples with strong prediction, followed by MM learning on these selected samples to further optimize the subexploration modality. MM learning forces different objective information to be learned from different modalities, avoiding the insufficient learning of modalities caused by the uniform learning objective. The experimental results show that the method can significantly boost different fusion strategies and methods in diversified multimodal datasets and improve the robustness of multimodal networks as well.

اسأل الذكاء الاصطناعي

Bookmark