What does this research mean for the field?

The proposed multimodal semantic understanding model significantly outperforms existing methods in honorific recognition, style classification, and sentiment analysis, achieving F1-scores of 0.89, 0.85, and 0.91, respectively. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The aim is to enhance Japanese multimodal semantic understanding, specifically addressing complex phenomena like honorifics.

March 13, 2026Open Access

Japanese multimodal semantic understanding model based on transformer architecture and multi-task Bayesian federated learning

Key Points

The aim is to enhance Japanese multimodal semantic understanding, specifically addressing complex phenomena like honorifics.
Developed a fusion architecture combining a Transformer backbone with multi-task Bayesian federated learning.
Constructed a task relevance matrix for dynamic edge weight assignment to facilitate knowledge transfer.
Introduced a hierarchical federated learning mechanism for local model training and global aggregation.
Implemented adaptive learning rate adjustments and early stopping for optimization.
Utilized a cross-modal contrastive loss function based on Kullback-Leibler divergence.
Achieved F1-scores of 0.89, 0.85, and 0.91 in honorific recognition, style classification, and sentiment analysis respectively.
In low-resource conditions, improved AUC by 0.17 compared to FedNLP using only 10% of the training data.
Under high-noise settings, achieved an accuracy improvement of 13–18 percentage points over FedNLP.

Abstract

Existing Japanese multimodal semantic understanding methods often rely on simple feature concatenation, which limits their ability to handle complex semantic phenomena such as honorifics and style. To overcome these limitations, this paper proposes a fusion architecture integrating a Transformer backbone with multi-task Bayesian federated learning. First, a task relevance matrix is constructed to dynamically assign edge weights, enabling effective cross-modal knowledge transfer. Second, a hierarchical federated learning mechanism is introduced, in which multi-task models are trained locally on clients and then aggregated globally through a parameter fusion strategy. Finally, adaptive learning rate adjustment and early stopping are employed to optimize convergence, while a cross-modal contrastive loss function based on Kullback–Leibler (KL) divergence enhances the accuracy of semantic alignment. Experimental results demonstrate that the proposed model achieves F1-scores of 0.89, 0.85, and 0.91 in honorific recognition, style classification, and sentiment analysis, respectively—significantly outperforming the MAG baseline. In low-resource scenarios with only 10% of the training data, the model achieves an AUC improvement of 0.17 over Federated Natural Language Processing (FedNLP). Under high-noise conditions (noise intensity = 0.6), the proposed method achieves an accuracy 13–18% points higher than FedNLP, effectively addressing the challenges of multimodal semantic understanding under complex linguistic features.

Demander à l'IA

Bookmark

View Full Paper

Cite This Study

Shao et al. (Tue,) studied this question.

synapsesocial.com/papers/69b3ab0002a1e69014ccba1c https://doi.org/https://doi.org/10.1007/s42452-026-08539-8

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Demander à l'IA

Bookmark

View Full Paper