What question did this study set out to answer?

The aim is to enhance image generation quality and speed through adaptive encoding of various image regions based on their information densities.

January 22, 2026

Toward Accurate Image Generation via Dynamic Generative Image Transformer

Key Points

The aim is to enhance image generation quality and speed through adaptive encoding of various image regions based on their information densities.
Developed Dynamic Quantization VAE++ for variable-length coding based on information density.
Implemented Dynamic Generative Image Transformer for autoregressive and non-autoregressive image generation.
Used a stacked-transformer architecture to model position and content for different code granularities.
Established a mask scheduling mechanism prioritizing information-dense regions.
Demonstrated improved image quality and generation speed compared to existing fixed-length encoding methods.
Validated effectiveness and efficiency in experiments on unconditional and conditional image generation.

Abstract

Existing generative image transformers follow a two-stage generation paradigm, where the first stage learns a codebook to encode images into discrete codes via vector quantization, and the second stage completes the image generation based on the learned codebook. However, existing methods ignore the naturally varying information densities across different image regions and indiscriminately encode fixed-size regions into fixed-length codes, resulting in insufficient encoding in important regions and redundant encoding in unimportant ones, which degrades both the image generation quality and speed. To address this challenge, we propose a novel information-density-based variable-length image coding and generation framework. In the first stage, our Dynamic Quantization VAE++ (DQVAE++) performs information-adaptive encoding by assigning variable-length codes to image regions according to their information densities, yielding more accurate and robust code representations. In the second stage, the Dynamic Generative Image Transformer (DGiT) enables information-adaptive image generation in both autoregressive and non-autoregressive manners. Specifically, for autoregressive (AR) generation, DGiT-AR generates images autoregressively from coarse-grained regions (smooth areas with fewer codes) to fine-grained regions (detailed areas with more codes). This is accomplished through a novel stacked-transformer architecture that alternately models the position and content of image codes, and a novel heterogeneous embedding scheme to distinguish codes of different granularities. Similarly, for non-autoregressive (NAR) generation, DGiT-NAR introduces a novel information-prioritized mask scheduling mechanism, prioritizing the generation of key structural regions with higher information density. This enables more coherent modeling of global structures initially, followed by a more effective synthesis of local details subsequently. Comprehensive experiments on unconditional and conditional image generation validate the superiority of our proposed variable-length coding in both effectiveness and efficiency.

Bookmark

Toward Accurate Image Generation via Dynamic Generative Image Transformer

Key Points

Abstract

Cite This Study

Also Consider

Also Consider