What question did this study set out to answer?

The research aims to develop a more efficient approach to visual generation by reducing computational costs associated with diffusion models.

January 18, 2026

DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation

Puntos clave

The research aims to develop a more efficient approach to visual generation by reducing computational costs associated with diffusion models.
Proposed Dynamic Diffusion Transformer (DyDiT) architecture to optimize computation by adjusting width dynamically during generation.
Introduced Timestep-wise Dynamic Width (TDW) that adapts based on generation timesteps.
Developed Spatial-wise Dynamic Token (SDT) strategy to minimize redundant computations.
Extended DyDiT to include flow matching for broader applicability in visual generation tasks.
Investigated parameter-efficient training through timestep-based dynamic LoRA (TD-LoRA).
DyDiT++ reduces the FLOPs of DiT XL by 51% with less than 3% additional fine-tuning iterations.
Achieves a realistic speedup of 1.73× on hardware for visual generation tasks.
Demonstrates competitive FID score of 2.07 on ImageNet, indicating high-quality output.

Resumen

Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior perfor mance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To overcome this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. TDW and SDT can be seamlessly integrated into DiT and significantly accelerate the generation process. Building on these designs, we present an extended version, DyDiT++, with improvements in three key aspects. First, it extends the generation mechanism of DyDiT beyond diffusion to flow matching, demon strating that our method can also accelerate flow-matching based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT++. Remarkably, with <3% additional f ine-tuning iterations, our approach reduces the FLOPs of DiT XL by 51%, yielding 1.73× realistic speedup on hardware, and achieves a competitive FID score of 2.07 on ImageNet. The code is available at https://github.com/alibaba-damo-academy/DyDiT.

Me gusta

Guardar

Me gusta

Guardar

DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation

Puntos clave

Resumen

Cite This Study