Multimodal fusion is susceptible to modality imbalance, where dominant modalities overshadow weak ones, easily leading to biased learning and suboptimal fusion, especially for incomplete modality conditions. To address this problem, we introduce an Equilibrium Deviation Metric (EDM) to quantify this imbalance and verify, in both theoretical and empirical terms, that the optimization order of modalities plays a critical role in approaching equilibrium. In particular, we demonstrate that an EDM-ranked weak-to-strong schedule achieves the tightest convergence bound among all possible ordering strategies. Leveraging these insights, we design an alternating strategy that dynamically prioritises under-optimised modalities, plus a modality-mapping layer for feature alignment and a memory module for information filtering and inheritance. Our framework is compatible with both conventional and MLLM-based backbones. It achieves new state-of-the-art (SOTA) on four benchmarks (e.g., +3.36% on CREMA-D, +3.51% on Kinetics-400), and remains robust under missing-modality conditions. These findings highlight the value of modality scheduling, offering a principled alternative to conventional joint training.
Shi et al. (Thu,) studied this question.