Masked Diffusion Models (MDMs) have emerged as a powerful generative modeling technique. Despite their remarkable results, they typically suffer from slow inference with several steps. In this paper, we propose DiMO, a novel approach that distills masked diffusion models into a one-step generator. DiMO addresses two key challenges: (1) the intractability of using intermediate-step information for one-step generation, which we solve through token-level distribution matching that optimizes model output logits by an 'on-policy framework' with the help of an auxiliary model; and (2) the lack of entropy in the initial distribution, which we address through a token initialization strategy that injects randomness while maintaining similarity to teacher training distribution. We show DiMO's effectiveness on both class-conditional and text-conditional image generation, impressively achieving performance competitive to multi-step teacher outputs while drastically reducing inference time. To our knowledge, we are the first to successfully achieve one-step distillation of masked diffusion models and the first to apply discrete distillation to text-to-image generation, opening new paths for efficient generative modeling.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yuanzhi Zhu
Xi Wang
Stéphane Lathuilière
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhu et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68e62de1a8c0c6d458740161 — DOI: https://doi.org/10.48550/arxiv.2503.15457