High-quality long-context data is essential for training large language models (LLMs) capable of processing extensive documents, yet existing synthesis approaches using relevance-based aggregation face challenges of computational efficiency. We present LiteLong, a resource-efficient method for synthesizing long-context data through structured topic organization and multi-agent debate. Our approach leverages the BISAC book classification system to provide a comprehensive hierarchical topic organization, and then employs a debate mechanism with multiple LLMs to generate diverse, high-quality topics within this structure. For each topic, we use lightweight BM25 retrieval to obtain relevant documents and concatenate them into 128K-token training samples. Experiments on HELMET and Ruler benchmarks demonstrate that LiteLong achieves competitive long-context performance and can seamlessly integrate with other long-dependency enhancement methods. LiteLong makes high-quality long-context data synthesis more accessible by reducing both computational and data engineering costs, facilitating further research in long-context language training.
Building similarity graph...
Analyzing shared references across papers
Loading...
J. Jia
Xing Wu
Chaochen Gao
Building similarity graph...
Analyzing shared references across papers
Loading...
Jia et al. (Fri,) studied this question.
www.synapsesocial.com/papers/68de5da783cbc991d0a20b41 — DOI: https://doi.org/10.48550/arxiv.2509.15568