Existing Japanese multimodal semantic understanding methods often rely on simple feature concatenation, which limits their ability to handle complex semantic phenomena such as honorifics and style. To overcome these limitations, this paper proposes a fusion architecture integrating a Transformer backbone with multi-task Bayesian federated learning. First, a task relevance matrix is constructed to dynamically assign edge weights, enabling effective cross-modal knowledge transfer. Second, a hierarchical federated learning mechanism is introduced, in which multi-task models are trained locally on clients and then aggregated globally through a parameter fusion strategy. Finally, adaptive learning rate adjustment and early stopping are employed to optimize convergence, while a cross-modal contrastive loss function based on Kullback–Leibler (KL) divergence enhances the accuracy of semantic alignment. Experimental results demonstrate that the proposed model achieves F1-scores of 0.89, 0.85, and 0.91 in honorific recognition, style classification, and sentiment analysis, respectively—significantly outperforming the MAG baseline. In low-resource scenarios with only 10% of the training data, the model achieves an AUC improvement of 0.17 over Federated Natural Language Processing (FedNLP). Under high-noise conditions (noise intensity = 0.6), the proposed method achieves an accuracy 13–18% points higher than FedNLP, effectively addressing the challenges of multimodal semantic understanding under complex linguistic features.
Shao et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: