What question did this study set out to answer?

Introduce and evaluate update objects for language model training, aiming to enhance efficiency without relying solely on text tokens.

May 16, 2026Open Access

Update Objects for Language Model Training: Replaying Training Progress Instead of Tokens

Puntos clave

Introduce and evaluate update objects for language model training, aiming to enhance efficiency without relying solely on text tokens.
Studied sparse checkpoint delta replay objects and optimizer transaction capsules in Transformer models.
Compared against several controls, including raw token replay and gradient alignment.
Performed multiple experiments across different model widths and text shards.
Sparse checkpoint deltas improved held-out cross-entropy progress by 2.45–2.90× over the best non-object replay control.
Optimizer transaction capsules outperformed raw replay by 1.78×.
Selective secondary objects demonstrated effectiveness while other tested capsules failed.

Resumen

Large language model training still treats text tokens as the main reusable unit of learning. If a training interval has already transformed parameters from θt to θt+k, then the interval also emits a causal object: the state transition that the optimizer actually bought. This paper introduces update objects, compact artifacts that replay useful training movement without replaying all source tokens. I study two concrete objects in small conventional decoder-only Transformers: sparse checkpoint delta replay objects and optimizer transaction capsules. All target arms start from the same θt and are compared against raw token replay, gradient alignment selected replay, checkpoint distillation, no-source continuation, warmstart upper-bound controls, and shuffled-object ablations. Sparse checkpoint deltas are the strongest result: across two model widths, two text shards, compression sweeps, pessimistic build-cost accounting, and a five-seed d96 paper-core gate, they improve durable held-out cross-entropy progress per paid unit by 2.45–2.90× over the best non-object replay control. A more selective secondary object also survives: optimizer transaction capsules beat raw replay by 1.78× in the same d96 five-seed paper-core gate, while mean-gradient, layerwise-gradient, alignment-weighted, and sign-consensus capsules fail or remain weak. These results do not claim production-scale speedup. They support a narrower thesis: previous training runs can produce reusable transition objects that are empirically different from data selection, distillation, checkpoint warm starts, or parameter-efficient adaptation

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo

Cite This Study

Julio Jose Lena (Thu,) studied this question.

synapsesocial.com/papers/6a080ae2a487c87a6a40cf59 https://doi.org/https://doi.org/10.5281/zenodo.20186220

Me gusta

Guardar

Ver artículo completo