What type of study is this?

This is a Quantitative Study study.

October 20, 2025Open Access

Scaling Diffusion Transformers Efficiently via μP

Key Points

DiT-μP achieves 2.9 times faster convergence than the original DiT-XL-2 with transferred learning rate.
Systematic validation shows that $μ$P effectively transfers hyperparameters to various diffusion transformers.
PixArt-$α$ and MMDiT demonstrate performance improvements while requiring only 5.5% and 3% of standard tuning costs.
The results establish $μ$P as a robust method for scaling diffusion transformers efficiently.

Abstract

Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization (μP) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether μP of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard μP to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that μP of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-α, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing μP methodologies. Leveraging this result, we systematically demonstrate that DiT-μP enjoys robust HP transferability. Notably, DiT-XL-2-μP with transferred learning rate achieves 2. 9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of μP on text-to-image generation by scaling PixArt-α from 0. 04B to 0. 61B and MMDiT from 0. 18B to 18B. In both cases, models under μP outperform their respective baselines while requiring small tuning cost, only 5. 5% of one training run for PixArt-α and 3% of consumption by human experts for MMDiT-18B. These results establish μP as a principled and efficient framework for scaling diffusion Transformers.

Scaling Diffusion Transformers Efficiently via μP

Key Points

Abstract

Cite This Study

Also Consider

Also Consider