December 3, 2025Open Access

Method of graph fusion of multimodal data using multitask learning

Key Points

Improved performance in multi-label classification was achieved using multimodal fusion techniques.
Experimental results showed significant advantages over baseline models through effective feature integration.
Developed a hierarchical multimodal fusion network that adjusts loss functions dynamically for better classification outcomes.
Findings highlight the flexibility and modularity of the neural network architectures utilized in the proposed model.

Abstract

In real-world conditions, data typically contain multiple modalities and may have non-exclusive labels. A key stage of multimodal learning is the process of multimodal fusion, as it enables the integration of features from different sources into a unified vector space. This allows the classifier to utilize the constructed integrated vector to produce the final prediction. At the same time, traditional multimodal fusion methods rarely take into account cross-modal interactions, which play an essential role in uncovering dependencies between modalities and in constructing a shared space of their integrated representation. In this paper, we propose a conceptual framework for multimodal fusion with the use of multi-task learning. It is aimed at modeling a joint integrated representation space for all cross-modal interactions and adaptively tuning the loss functions of individual tasks in order to achieve optimal performance. The developed model employs a novel hierarchical multimodal fusion network that captures cross-modal interactions across all modality combinations and dynamically allocates weight coefficients for each pair depending on the specific data sample. In addition, a new multi-task learning approach is introduced to address multi-label classification challenges by automatically adjusting the training process both at the task level and at the sample level. Experimental results demonstrate that the proposed conceptual framework outperforms baseline models as well as several state-of-the-art methods. Furthermore, the flexibility and modularity of the proposed components of multimodal fusion and dynamic multi-task learning are showcased, making them applicable to various types of neural network architectures.

Read Full Paperexternally

Demander à l'IA

Bookmark

View Full Paper