What question did this study set out to answer?

This work aims to determine which generative modeling paradigm achieves lower risk under a fixed compute budget as training resources increase.

April 25, 2026Open Access

SCALE-X: A Compute-Optimal Crossover Framework for Unifying Autoregressive and Diffusion Generative Models at the Scaling Limit

Key Points

This work aims to determine which generative modeling paradigm achieves lower risk under a fixed compute budget as training resources increase.
Adopts a matched-compute excess-risk formulation for comparison.
Models both autoregressive and diffusion models using a common asymptotic template.
Includes simulations to illustrate different scaling regimes based on compute budget.
Crossover behavior between AR and diffusion models is influenced by the relative scaling exponents and distribution properties.
Small gaps in scaling exponents can lead to significantly different crossover requirements even favoring diffusion.
Crossover estimation is sensitive, necessitating uncertainty intervals rather than single estimates for practical applications.

Abstract

SCALE-X technical report / preprint. Autoregressive (AR) and diffusion models are frequently compared using different objectives, architectures, and evaluation metrics, complicating asymptotic comparison. This paper studies a narrower question: as training compute grows, which paradigm achieves lower risk under a fixed compute budget, and under what assumptions can a crossover occur? A matched-compute excess-risk formulation is adopted rather than directly equating AR negative log-likelihood with diffusion denoising or score-matching objectives. Both families are modeled through a common asymptotic template, Eₖ (C) ~ Aₖ * C^ (-ₖ), k AR, Diff, where C denotes training compute and Eₖ (C) is the excess population risk above the paradigm-specific asymptotic floor. Crossover behavior depends on the relative exponents ₖ, prefactors Aₖ, and asymptotic losses. Three claims are made. First, crossover is only well defined after fixing the comparison object and compute accounting. Second, scaling exponents should depend on properties of the data distribution: AR is expected to benefit from low conditional entropy and strong sequential structure, whereas diffusion is expected to benefit from smooth score fields and low intrinsic geometric complexity. Third, crossover estimation is highly sensitive when exponents are close, so practical claims should report uncertainty intervals rather than a single point estimate. To make the framework concrete, simulation tables computed from explicit synthetic scaling laws are included, illustrating three regimes: AR-favored, diffusion-favored, and crossover. Results show that small exponent gaps can imply very large crossover budgets even when the higher-compute asymptotic slope favors diffusion. The goal is not to declare a universal winner, but to provide a sharper and more credible framework for studying AR-versus-diffusion scaling under controlled assumptions. Existing OSF archival DOI: 10. 17605/OSF. IO/N4ZMS; Existing OSF archival page: https: //osf. io/n4zms/. Files include the technical report PDF and the LaTeX source tarball when available.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Haopeng Jin (Mon,) studied this question.

synapsesocial.com/papers/69ec5b3d88ba6daa22dacc41 https://doi.org/https://doi.org/10.5281/zenodo.19712506

Bookmark

View Full Paper