Optimizing Inference in Large Language Diffusion Mixture-of-Experts via Hardware-Aware KernelsThis work addresses the critical performance bottlenecks in diffusion-based Mixture-of-Experts (MoE) models, specifically focusing on the Large Language Diffusion with Masking (LLaDA) architecture. Due to the iterative nature of the denoising process, standard MoE implementations suffer from significant host-device synchronization overhead and fragmented memory access. We propose FastLLaDAMoE, an optimized framework that utilizes a Sort-Compute-Scatter pipeline and expert weight stacking to ensure contiguous GPU memory access.Experimental evaluations on NVIDIA A100 hardware demonstrate a 1.89x reduction in CUDA execution time and a 1.93x improvement in memory bandwidth utilization while maintaining full numerical parity with the baseline. By transitioning the MoE forward pass from a memory-bound, CPU-bottlenecked state to a hardware-saturated regime, this work makes large-scale iterative alignment (e.g., GRPO) computationally feasible for diffusion-based language models.
Alexey Manakonov (Thu,) studied this question.