What question did this study set out to answer?

The research aims to enhance the performance of the LLaDA-MoE architecture by reducing bottlenecks during inference.

February 21, 2026Open Access

Hardware-Saturated Denoising: Accelerating LLaDA-MoE via Permuted Expert Dispatch with benchmark data for gsm8k

Key Points

The research aims to enhance the performance of the LLaDA-MoE architecture by reducing bottlenecks during inference.
Proposed FastLLaDAMoE framework to optimize inference processes
Utilized Sort-Compute-Scatter pipeline for efficient GPU memory management
Implemented expert weight stacking for contiguous memory access
Conducted experiments on NVIDIA A100 hardware
Achieved a 1.89x reduction in CUDA execution time
Improved memory bandwidth utilization by 1.93x
Maintained full numerical parity with baseline

Abstract

Optimizing Inference in Large Language Diffusion Mixture-of-Experts via Hardware-Aware KernelsThis work addresses the critical performance bottlenecks in diffusion-based Mixture-of-Experts (MoE) models, specifically focusing on the Large Language Diffusion with Masking (LLaDA) architecture. Due to the iterative nature of the denoising process, standard MoE implementations suffer from significant host-device synchronization overhead and fragmented memory access. We propose FastLLaDAMoE, an optimized framework that utilizes a Sort-Compute-Scatter pipeline and expert weight stacking to ensure contiguous GPU memory access.Experimental evaluations on NVIDIA A100 hardware demonstrate a 1.89x reduction in CUDA execution time and a 1.93x improvement in memory bandwidth utilization while maintaining full numerical parity with the baseline. By transitioning the MoE forward pass from a memory-bound, CPU-bottlenecked state to a hardware-saturated regime, this work makes large-scale iterative alignment (e.g., GRPO) computationally feasible for diffusion-based language models.

Hardware-Saturated Denoising: Accelerating LLaDA-MoE via Permuted Expert Dispatch with benchmark data for gsm8k

Key Points

Abstract

Cite This Study