Existing generative image transformers follow a two-stage generation paradigm, where the first stage learns a codebook to encode images into discrete codes via vector quantization, and the second stage completes the image generation based on the learned codebook. However, existing methods ignore the naturally varying information densities across different image regions and indiscriminately encode fixed-size regions into fixed-length codes, resulting in insufficient encoding in important regions and redundant encoding in unimportant ones, which degrades both the image generation quality and speed. To address this challenge, we propose a novel information-density-based variable-length image coding and generation framework. In the first stage, our Dynamic Quantization VAE++ (DQVAE++) performs information-adaptive encoding by assigning variable-length codes to image regions according to their information densities, yielding more accurate and robust code representations. In the second stage, the Dynamic Generative Image Transformer (DGiT) enables information-adaptive image generation in both autoregressive and non-autoregressive manners. Specifically, for autoregressive (AR) generation, DGiT-AR generates images autoregressively from coarse-grained regions (smooth areas with fewer codes) to fine-grained regions (detailed areas with more codes). This is accomplished through a novel stacked-transformer architecture that alternately models the position and content of image codes, and a novel heterogeneous embedding scheme to distinguish codes of different granularities. Similarly, for non-autoregressive (NAR) generation, DGiT-NAR introduces a novel information-prioritized mask scheduling mechanism, prioritizing the generation of key structural regions with higher information density. This enables more coherent modeling of global structures initially, followed by a more effective synthesis of local details subsequently. Comprehensive experiments on unconditional and conditional image generation validate the superiority of our proposed variable-length coding in both effectiveness and efficiency.
毛泽普 et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: