Different mixtures of multimodal training data significantly impact the performance of multimodal large language models, and manually tuning data mixtures is inefficient, computationally expensive, and frequently suboptimal because of complex, nonlinear inter-modal interactions. How to determine data-mixture hyperparameters in an efficient and principled manner becomes the bottleneck for progress in the field. This study establishes a scalable, learnable framework, DMPredictor, that treats multimodal data-mixture design as a regression-based hyperparameter-optimization problem and automates the selection of effective training data mixtures. DMPredictor is trained on data mixture samples derived from hundreds of small proxy models (2M parameters), each of which is trained on 1B tokens sampled using different data mixtures. The framework incorporates alignment-aware smoothing and quality-reweighting, enabling diverse exploration of the multimodal data mixture space while avoiding distribution collapse. DMPredictor produces accurate performance forecasts and identifies nearly optimal data mixtures. The predicted optimal mixture surpasses human-designed baselines on diverse benchmarks, achieving +2.7% on MMMU, +6.4% on TextVQA, and +195.2 on MME. Moreover, the mixture optimization complexity is largely reduced by small proxies and a small number of tokens. The proposed approach offers a robust, computationally efficient pathway for optimizing mixtures of multimodal training data, addressing the critical challenge of training data heterogeneity.
Shang et al. (Mon,) studied this question.