What type of study is this?

This is a Experimental Study study.

October 2, 2025Open Access

LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs

Key Points

LiteLong reduces computational costs while synthesizing high-quality long-context data for LLMs.
In experiments, LiteLong achieved competitive performance on HELMET and Ruler benchmarks for long-context tasks.
Utilizing a structured topic organization with BISAC and BM25 retrieval, LiteLong generates diverse training samples.
The method supports integration with other long-dependency enhancement techniques to improve language model training.

Abstract

High-quality long-context data is essential for training large language models (LLMs) capable of processing extensive documents, yet existing synthesis approaches using relevance-based aggregation face challenges of computational efficiency. We present LiteLong, a resource-efficient method for synthesizing long-context data through structured topic organization and multi-agent debate. Our approach leverages the BISAC book classification system to provide a comprehensive hierarchical topic organization, and then employs a debate mechanism with multiple LLMs to generate diverse, high-quality topics within this structure. For each topic, we use lightweight BM25 retrieval to obtain relevant documents and concatenate them into 128K-token training samples. Experiments on HELMET and Ruler benchmarks demonstrate that LiteLong achieves competitive long-context performance and can seamlessly integrate with other long-dependency enhancement methods. LiteLong makes high-quality long-context data synthesis more accessible by reducing both computational and data engineering costs, facilitating further research in long-context language training.

Ask AI

Helpful

Bookmark

View Full Paper