What question did this study set out to answer?

This research aims to improve the performance of LLaDA-MoE during generation by addressing CPU-bound bottlenecks.

March 21, 2026Open Access

Hardware-Aware MoE Inference for Diffusion LLM (LLADA MoE): An Ablation Study of Custom Triton Kernel

Key Points

This research aims to improve the performance of LLaDA-MoE during generation by addressing CPU-bound bottlenecks.
Developed a custom 2D fused Triton kernel to optimize parallelization strategies.
Shifted from token-centric partitioning to expert’s internal dimensions.
Fused the pipeline to manage data-flow directly on GPU hardware.
Achieved a 6x improvement in throughput, exceeding 300 tokens per second.
Reduced fixed dispatch costs by 20 times.
Decreased CPU-side overhead by approximately 50%.
Maintained a stable VRAM footprint of around 14 GiB with minimal fluctuations.

Abstract

This research addresses the severe CPU-bound dispatch bottlenecks that cripple LLaDA-MoE performance during the generation phase. The core solution is a custom 2D fused Triton kernel that shifts the parallelization strategy from standard token-centric partitioning to the expert's internal intermediate dimension. This ensures high GPU utilization even in "skinny matrix" scenarios where experts receive only 10 to 50 tokens, which normally leads to massive hardware under-utilization. By fusing the entire pipeline and moving data-flow management directly to the GPU hardware, the kernel replaces slow CPU-side sorting with fast hardware-level accumulation. The benchmarks demonstrate a paradigm shift from host-bound to execution-bound processing, resulting in a 6x throughput improvement that exceeds 300 tokens per second. Fixed dispatch costs were reduced by 20 times, and the CPU-side overhead was cut roughly in half compared to native implementations. Despite these significant performance gains, the VRAM footprint remained identical to the baseline, holding steady at approximately 14 GiB with less than 1% fluctuation. This confirms that hardware-aware kernel optimization can virtually eliminate architectural overhead without sacrificing memory stability or generation quality on LLaDA-MoE.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper

Cite This Study

Aleksei Manakonov (Thu,) studied this question.

synapsesocial.com/papers/69be38da6e48c4981c6797f3 https://doi.org/https://doi.org/10.5281/zenodo.19116469

AI에게 질문

Bookmark

View Full Paper