Iterative self-training of language models presents a promising avenue for realizing self-improving Artificial Intelligence systems; however, this process is often hindered by the fundamental challenge of “Model Collapse.” Existing research indicates that models undergo catastrophic performance degradation and diversity collapse when recursively trained on their own increasingly homogenized synthetic data. Although some data selection-based approaches attempt to mitigate this issue by enhancing diversity, they predominantly rely on static strategies, lacking a feedback mechanism capable of adapting in real-time to the dynamic evolution of the model state and data distribution. To address this limitation, we propose a dynamic data selection framework titled “DCES” (dynamic center-edge sampling). We conducted extensive experiments on iterative self-training tasks across multiple model architectures. The results demonstrate that our system significantly outperforms baselines in terms of Perplexity (PPL) and loss across various models and test sets. Simultaneously, the framework effectively mitigates the degradation of Expected Calibration Error (ECE) and entropy metrics, successfully preventing mode collapse. Our findings highlight that an adaptive system capable of intelligent data curation based on training feedback is pivotal for maintaining the dynamic balance of data distributions and achieving sustainable AI self-evolution. This work provides a systematic methodology for realizing this goal.
Zhu et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: