What question did this study set out to answer?

The aim is to optimize multimodal data mixtures efficiently for training large language models.

June 17, 2026Open Access

Optimize Multimodal Data Mixture for Pre-Training with Loss Regression

Key Points

The aim is to optimize multimodal data mixtures efficiently for training large language models.
Established a scalable framework called DMPredictor for hyperparameter optimization.
Trained DMPredictor on data mixture samples from small proxy models (2M parameters).
Incorporated alignment-aware smoothing and quality-reweighting for exploration of data mixture space.
Predicted optimal data mixtures showed performance increases of +2.7% on MMMU, +6.4% on TextVQA, and +195.2 on MME.
Reduction in mixture optimization complexity by using small proxies and fewer tokens.
DMPredictor outperformed human-designed baselines in various benchmarks.

Abstract

Different mixtures of multimodal training data significantly impact the performance of multimodal large language models, and manually tuning data mixtures is inefficient, computationally expensive, and frequently suboptimal because of complex, nonlinear inter-modal interactions. How to determine data-mixture hyperparameters in an efficient and principled manner becomes the bottleneck for progress in the field. This study establishes a scalable, learnable framework, DMPredictor, that treats multimodal data-mixture design as a regression-based hyperparameter-optimization problem and automates the selection of effective training data mixtures. DMPredictor is trained on data mixture samples derived from hundreds of small proxy models (2M parameters), each of which is trained on 1B tokens sampled using different data mixtures. The framework incorporates alignment-aware smoothing and quality-reweighting, enabling diverse exploration of the multimodal data mixture space while avoiding distribution collapse. DMPredictor produces accurate performance forecasts and identifies nearly optimal data mixtures. The predicted optimal mixture surpasses human-designed baselines on diverse benchmarks, achieving +2.7% on MMMU, +6.4% on TextVQA, and +195.2 on MME. Moreover, the mixture optimization complexity is largely reduced by small proxies and a small number of tokens. The proposed approach offers a robust, computationally efficient pathway for optimizing mixtures of multimodal training data, addressing the critical challenge of training data heterogeneity.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper