Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior perfor mance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To overcome this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. TDW and SDT can be seamlessly integrated into DiT and significantly accelerate the generation process. Building on these designs, we present an extended version, DyDiT++, with improvements in three key aspects. First, it extends the generation mechanism of DyDiT beyond diffusion to flow matching, demon strating that our method can also accelerate flow-matching based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT++. Remarkably, with <3% additional f ine-tuning iterations, our approach reduces the FLOPs of DiT XL by 51%, yielding 1.73× realistic speedup on hardware, and achieves a competitive FID score of 2.07 on ImageNet. The code is available at https://github.com/alibaba-damo-academy/DyDiT.
Zhao et al. (Thu,) studied this question.